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[57] ABSTRACT 

A system and a method for searching does not rely on prior 
compiled vocabulary information or grammatical informa- 
tion to perform a search. The search may accommodate new 
words or phrases, and perform a document search using a 
request of a user for document search. A unique charac ter 
string is extracted from an input document and a similarity 
searchjs DerfnrmerCbT using~the ^iQue charactef^ tring. 
The extraction of the unique character string is performed by 
calculating and evaluating an amount of feature of a char- 
acter string through comparison between appearance fre- 
quency appearing in the input document and appearance 
frequency in a set of documents to be searched. Then, the 
extracted unique character string is used as the basis for the 
search. Documents found by the search are evaluated and 
arranged in the order of evaluation. The similarity factor of 
document is evaluated by using the appearance frequency of 
each unique character string in the input document so that 
higher evaluation is provided to a document in which unique 
character strings with higher weight appears many times. 

19 Claims, 14 Drawing Sheets 
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INFORMATION SEARCH METHOD, trademarks, trade name, and product names, etc. However, 

INFORMATION SEARCH DEVICE, AND such updating requires enormous labor, and the region for 

STORAGE MEDIUM FOR STORING AN storing the word dictionary or the like is increased as new 

INFORMATION SEARCH PROGRAM words as added. This, in turn, would adversely affect search 

5 speed. 

BACKGROUND OF THE INVENTION In addition, PUPA 6-223114 describes a method for 

. ( , . processing character strings based on the frequency of the 

1. Meld ot roe Invention appearance of a word. However, such technology is used to 
The present invention relates to a system and method for determine the type of a document, or to extract keywords for 

searching a large volume of documents stored in a computer. a search by searching a registered word list (word 

More particularly, the invention relates to a system and 10 dictionary) to determine whether words in the document 

method for searching for stored documents that are similar exist in the word dictionary. Unlike these conventional 

to a document having one or more particular string charac- search methods, the present invention investigates the fre- 

teristics at a high speed while allowing desired ambiguity in quency of the appearance of a word or character string in an 

the search of the stored documents. input document and a comparison document, and utilizes the 

2. Background Art 15 fr e Q uenc y of appearance in both of them. (In the word 
T ' , j c « j . c list -based method, one word appears only once in the 
In known search methods for locating a document of dictio „ mat it h me aningless to investigate the fre- 

interest from a set of electronic document texts it is com- ^ > f nce fa the ff st) . Accordingly, in PUPA 

monpracticetomputa searchexpressioninwhichcharacter 2-223114, a stationary word dictionary is still necessary so 

strings mdicative of the subject matter of mterest are con- 20 ^ thcfc sdu rcmaiQS ^ above . mentioacd blem that a 

nected by logical operators such as AND, OR, or NOT. An aew wQrd of ^ camot bc ^ ^ a ufli 

example of a search expression follows. chmi ^ {q ^ ^ 

(computer OR personal computer) AND search detects keywords based on stationary word dictionaries by 

This method gives all operation and discretion to the user in category, if there are multiple documents describing "meth- 

converting the subject matter of interest into the search 2 5 Q ds for searching documents" for example, there is a high 

expression. It is often troublesome to formulate suitable possibility that the keywords being extracted are very simi- 

search character strings and to input the suitable search lar ones such as "search", "character string", and "high 

expression. In addition, the quality of the search results speed » Therefore, it is difficult to extract keywords for 

greatly depends on skill of the user in constructing the search differentiating each document, introducing inefficiencies in 

expression. 30 me ^ch document. 

A method is also known where each search character An object of the present invention is build a search system 

string of the search expression is weighted according to which cnab i es the user to a input a complicated search 

importance of the string. A result of such a search comprises con cept with a very simple operation, such as the clicking of 

sequentially output documents corresponding to those con- a button, without requiring the user to think up or to input 

taming higher weight of the search character strings. An 35 a expression 

example of a weighted search expression follows. Another object of the present invention is to provide a 

Computer, 60 personal computer, 60 search, 100 search method which parallels a human thought process to 

Like the first conventional method, this weighting method enable a user to easily input complicated and abstract 

also gives all operation and discretion for converting the concepts for a search. 

subject matter of interest to the search expression to the user. 40 Another object of the present invention is to provide a 

The above methods further require the user to fully under- search method which reduces labor of the user to think up or 

stand the contents of documents to be searched and the terms to input a search character string or a search expression, to 

being used in roe search expression. Therefore, if a user provide a search method which can be easily used by 

vaguely thinks "I want to read sentences with such sense", everybody, and enables a user to perform search even if the 

it is difficult to construct a meaningful search expression 45 user does not accurately understand a keyword to be used for 

from this general statement. the search. 

On the other hand, there is a technology as described in Another object of the invention is to provide a search 

Published Unexamined Patent Application (PUPA) No. method which relatively and dynamically extracts a unique 

6-124305 to is input natural language expressions for a character string without using vocabulary information or 

search, to extract search keys, and to perform the search 50 grammar information. 

based on the extracted search keys. Such a search requires Another object of the present invention is to provide a 

a search key dictionary. In a method performing extraction search method which requires less storage capacity and 

based on vocabulary information (word dictionary) such as extracts a unique character string at a high speed, 

the search key dictionary or grammatical connection rules, QIIMMADV ^ tup tma/pntfiom 

the word dictionary or the grammatical connection rules are 55 SUMMARY OF THE INVENTION 

generally non-dynamic. Therefore, a new word such as The present invention solves the above problems by 1) 

"TOYSARUS" ("TOYSARUS" is a trademark of Toysarus extracting a unique character string (a character string 

Inc.) or a phrase such as "footprint of dinosaur" cannot be characterizing the sentence when viewing the character 

extracted as a unique character string without great diffi- string with respect to the entire set of documents) from an 

culty. In addition, the concepts or perceived "features" of the 60 input sentence; 2) allocating a suitable matching factor to 

contents of a document may change over time. For example, each unique character string for an ambiguity search; and 3) 

in the past, white-collar workers always wore suits when evaluating the located documents of a search request by 

they came to the office. More recently, there are many cases using appearance frequency information in the input sen- 

where white-collar workers may not wear suits because tence as a weight factor, and rearranging the located docu- 

many companies have adopted a "casual day" system. To 65 ments in the order of evaluation. 

keep pace with such changes, it is necessary to continuously "Input sentence" described herein means one or more 

update the word dictionary to include new words, new sentences in a language such as Japanese or English, and 
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may comprise the entire content of a sentence or a para- 
graph. Id addition, it may be a sentence in which multiple 
languages such as Japanese and English are intermixed. 

The "unique character string" means a character string 
characterizing the sentence when viewing the character 
string with respect to the entire set of documents, or com- 
paring it with another sentence. To provide an analogy, the 
unique characteristic string corresponds to a word(s) in a 
input sentence in the same way that a physical feature 
corresponds to a person to distinguish the person from other 
individuals in a group. If most people in a group wear 
glasses, then the feature "wearing glasses" is not the feature 
of a specific person. On the other hand, if most people in a 
group are in casual clothes and a specific person wears a suit, 
then the feature "wearing a suit" is a significant feature of 
that person. 

i V A unique character string extracted from the input sen -" 
tence Js,weig hted by the appearance frequency infor mation 
of toe unique character string. An ambiguous search is made 
based on the weighted unique character string. When a set of 
se ntences are previously arranged as an index fil e by extrac t- 
ing ap pearance position inf orm ation of N-charac ter chains, 
a high speed search can be performed with rfifaienceto 
c haracter strings in a documen t. Since extraction of woraBy 
morpheme analysis is not performed, "dictionary mainte- 
nance" is unnecessary, documents can be registered at a high 
speed, and ambiguity search can be performed for searching 
ch aracter strings in st ored documents with a similar arrange - 
meifco Pnaracters asthe~ uniq ue-c haracter stri ng] 

An aspect of the present invention provides a method for 
identifying a unique character string contained in an input 
document which is input into a computer system, said 
computer system being operable to search comparison docu- 
ments stored in a storage medium, the method comprising: 
associating and managing position information for a posi- 
tion in the comparison document where a partial comparison 
document character string extracted from the comparison 
document exists with the partial comparison document char- 
acter string; 

extracting a partial input character string from the input 
document, and determining whether the partial input char- 
acter string is candidate character string; identifying a partial 
comparison document character string which coincides with 
a part of the candidate character string at a predetermined 
similarity factor or higher; 

identifying position data associated with the partial com- 
parison document character string which coincides with the 
predetermined similarity factor; and 

recognizing the candidate character string as the unique 
character string by comparing appearance frequency infor- 
mation on a part of the candidate character string appearing 
in the input document with the position data, and evaluating 
an amount of feature of the candidate character string. When 
such a unique character string is recognized, the type of the 
document can be intuitively understood. 

"Comparison document stored in the storage medium" 
described herein includes not only a document stored in a 
storage device in the computer system, but also a document 
stored in another system but searchable by this computer. In 
addition, the comparison document may be a single 
document, or multiple documents, or parts of single or 
multiple documents (e.g., a title, a body excluding the title, 
a footnote or the like). Moreover, in the case of multiple 
documents, it may be a set of documents including the input 
document, or a set of document extracted by search or the 
like. The contents of document may be of a natural language 
or a program language. 
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In addition, "input document" includes not only an entire 
document in a natural or program language stored in the 
storage device in the computer, but also a document which 
is stored in another system, but parts of which are extracted 
and input in this computer. In addition, the document may be 
a single document, or multiple documents, or parts of single 
or multiple documents. Moreover, the input document may 
be extracted parts of the comparison document. 

"A partial input character string being extracted from the 
input document" may be not only a fixed character string of 
N characters of a non-delimiter language (N is a natural 
number of 1 or more) or a variable character string of N or 
more characters of a non-delimiter language, but also one or 
more words of a delimiter language. The partial input 
15 character string may comprise keywords extracted from the 
input document based on the vocabulary information (word 
dictionary) or the grammatical connection rules, but is not 
necessarily extracted from such information. 

"A part of the candidate character string" may include all 
of a candidate character string, while "appearance frequency 
mformation" means information relating to the number of 
appearances of a part of the candidate character string in the 
input document, the comparison document or the like, and 
may be not only the number of appearances derived by 
investigating all of a document, but also information based 
on the number of appearance in a sample of each document. 
In addition, as described with respect to a preferred embodi- 
ment of the present invention, appearance frequency infor- 
mation may be information which is a value relating to the 
number of appearances in the document, the value including 
a size of position information data (size of position infor- 
mation file shown in FIG. 3 in bytes) corresponding to the 
N-character chain in the candidate character string quantized 
to Q level (Q is a constant), or a value which corresponds to 
such value as compressed. 

"Associating and managing position information for a 
position in the comparison document with a partial com- 
parison document character string" is desirably management 
by a position information file as shown in FIG. 3 of the 
preferred embodiment of the present invention. However, 
management may be implemented by using a table or 
information for pointing to an information storage position 
as long as the position information in the comparison 
document is associated with the partial comparison docu- 
ment character string and managed. 

Although "an amount of feature" corresponds to points of 
the candidate character string as indicated in the preferred 
embodiment of the present invention, the concept is not 
limited to such. During calculation of points of a candidate 
character string, it may be possible to detect that the can- 
didate character string satisfies a criterion such that it is 
recognized as a unique character string, and to recognize the 
string as a unique character string based on this criterion. 

For "evaluating an amount of feature", it is possible to set 
various conditions such as a case where upper X character 
strings with a high amount of feature (becoming a feature of 
input document) are evaluated as the unique character string 
of the input document in question, a case where a candidate 
character string with an amount of feature exceeding a 
threshold is evaluated as the unique character string of the 
input document in question, or a case where a candidate 
character string including upper X character string and 
having an amount of feature exceeding a threshold is evalu- 
ated as the unique character string of the input document in 
question. The identified unique character string may be used 
for search as is, or further selected by other conditions (for 
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example, adopted or rejected based on the degree of over- calculating a first appearance frequency value on a part of 

lapping in the input document). the first unique character string appearing in the is compari- 

Another aspect of the present invention provides a method 500 document; 

for searching for a comparison document, which has char- calculating a second appearance frequency value on a part 

acter strings similar to a partial input character string exist- 5 of the second unique character string appearing in the 

ing in an input document. The search is performed on a comparison document; and 

plurality of documents to be searched which are searchably calculating the similarity factor of the comparison docu- 

stored in a computer, the method comprising: extracting a ment from lhe fiist appearance frequency value taking the 

partial character string from the input document, and deter- *"t wci S ht value * cco ™ { * e ^cond appearance 

mining whether the partial character string is a candidate 10 frequency value taking the second weight value into 

character string; accounL 

evaluating an amount of featureof the candidate character Another aspect of the present invention provides an 
string through comparison between appearance frequency apparatus for evaluating similarity between a comparison 
information on a part of the candidate character string document and a unique character string input m a computer 
appearing in the input document and appearance frequency " s y s ! em ' ™ d computer system being able to search a corn- 
information on a part of the candidate character string parison document stored m a storage medium, the apparatus 
appearing in the comparison document to recognize the comprising. 

candidate character string as a unique character string; and means for calculating a weight value corresponding to the * 

searching a document to be searched having character um > e characte / f^ 8 fr ° m a PP earance frequency informa- 

strings similar to the unique character string from the 20 hon on a part of the unique character strmg appearmg m the 

plurality of documents to be searched. m P ut document; and 

"Character strings similar to the unique character string" mean * for calculating the similarity factor of the corn- 
means character strings resembling the unique character parison document from the appearance frequency informa- 
string with a predetermined similarity factor or higher, tion on a part of the unique character stnng appearmg m the 
including a character string with a similarity factor of 100%, 25 comparison document. 

or complete matching. "Search" described herein includes Another aspect of the present invention provides an 

not only a search method for an ambiguity search described apparatus for identifying a unique character string contained 

in the preferred embodiment of the present invention, but in an input document which is input into a computer system, 

also any and all search methods which can search a docu- said computer system containing a comparison document 

ment from character strings. 30 fcarchably stored by the computer, the apparatus compris- 

"Comparison between appearance frequency informa- m £* 
tion" described herein is calculated based on "appearance a storage device for storing a position information file 
frequency information in input document/appearance fre- which associates and manages position information for a 
quency information in comparison document" as the sim- position in the comparison document where a parUal com- 
plest example, but may be replaced by various calculation Prison document character string extracted from the corn- 
formulae as described in the preferred embodiment of the parison document exists with the partial comparison docu- 
invention. ment character string; 

Another aspect of the present invention provides a method extracting means for extracting a candidate character 

for identifying a unique character string contained in an ^ strin 8 from m f m P ut document; 

input document which is input into a computer system, said means receiving an output of the extracting means, for 

computer system being able to search a comparison docu- identifying a partial comparison document character string 

ment stored in a storage medium, the method comprising: which matches a part of the candidate character string with 

extracting a partial input character string from the input a predetermined similarity factor or higher, 

document, and determining whether the partial input char- 4S means for identifying position information which is asso- 

acter string is a candidate character string; and evaluating an ciated with the partial comparison document character string 

amount of feature of the candidate character string through wth the predetermined similarity factor or higher in the 

comparison between appearance frequency information on a position information file; and 

part of the candidate character string appearing in the input means for recognizing the candidate character string as 

document and appearance frequency information on a part 50 the unique character string by comparing appearance fre- 

of the candidate character string appearing in the compari- quency information on a part of the candidate character 

son document to recognize the candidate character string as string appearing in the input document with the position 

a unique character string. information, and evaluating an amount of feature of the 

Another aspect of the present invention provides a method candidate character string, 

for evaluating similarity between a comparison document 55 Another aspect of the present invention provides an 

and an input document which contains a first unique char- apparatus for searching a document to be searched, said 

acter string and a second unique character string input in a document having a character string similar to a partial input 

computer system, said computer system being operable to character string which exists in an input document input in 

search a comparison document stored in a storage medium, a computer, said document being searched from a plurality 

the method comprising the steps of: calculating a first weight 60 of documents which are searchably stored in the computer, 

value corresponding to the first unique character string from the apparatus comprising: 

appearance frequency information on a part of the first an input device for identifying the input document and for 

unique character string appearing in the input document; instructing execution of search; means for detecting from the 

calculating a second weight value corresponding to the input device the fact that the input document is identified and 

second unique character string from appearance frequency 65 that the instruction of search is input; 

information on a part of the second unique character string means for extracting a candidate character string from the 

appearing in the input document; input document in response to the detection of the condition 
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that the input document is identified and that the instruction 
of search is input; 

means for calculating an amount of feature through com- 
parison between appearance frequency information on a part 
of the candidate character string appearing in the input 
document and appearance frequency information on a part 
of the candidate character string appearing in the compari- 
son document; 

means for determining the candidate character string as a 
unique character string by evaluating the amount of feature; 
means for searching the document to be searched having a 
character string similar to the unique character string from a 
plurality of documents to be searched; and 

a display device for displaying the document to be 
searched having a character string similar to the unique 
character string. 

Another aspect of the present invention provides an 
apparatus for identifying a unique character string contained 
in an input document which is input into a computer system, 
said computer system containing a comparison document 
searchably stored by the computer, the apparatus compris- 
ing: 

means for extracting a candidate character string from the 
input document; and 

means for determining the candidate character string as a 
unique character string by evaluating an amount of feature 
of the candidate character string through comparison 
between appearance frequency information on a part of the 
candidate character string appearing in the input document 
and appearance frequency information on a part of the 
candidate character string appearing in the comparison 
document. 

Another aspect of the present invention provides an 
apparatus for evaluating similarity between a comparison 
document and an input document containing a unique char- 
acter string input into a computer system, said computer 
system containing a comparison document searchably stored 
by the computer, the apparatus comprising: 

means for calculating a weight value corresponding to the 
unique character string from appearance frequency informa- 
tion on a part of the unique character string appearing in the 
input document; and 

means for calculating a similarity factor of the compari- 
son document from appearance frequency information on a 
part of the unique character string appearing in the com- 
parison document and the weight value. 

Another aspect of the present invention provides a storage 
medium readable by a computer for storing a program which 
identifies an input document input into a computer system, 
said computer system containing a comparison document 
searchably stored by the computer, the program comprising: 

program code means for directing the computer to extract 
a partial character string from the input document and 
making the partial character string a candidate string; and 

program code means for directing the computer to deter- 
mine whether the candidate character string is a unique 
character string by evaluating an amount of feature of the 
candidate character string through comparison between 
appearance frequency information on a part of the candidate 
character string appearing in the input document and appear- 
ance frequency information on a part of the candidate 
character string appearing in the comparison document. 

The storage medium includes a floppy disk, a CD-ROM, 
an MO, a PD or a storage device connected to a network. 
The program code may be divided into a plurality of 
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segments and stored in a plurality of media. In addition, the 
program may be compressed and stored in a floppy disk. The 
medium is loaded on the system through various drives such 
as a floppy disk drive, a modem, or a serial port. 

Another aspect of the present invention provides a storage 
medium readable by a computer for storing a program which 
evaluates similarity between a comparison document and an 
input document containing a unique character string input 
into a computer system, said computer system containing a 

10 comparison document searchably stored by the computer, 
the program comprising: 

program code means for directing the computer to calcu- 
late a weight value corresponding to the unique character 
string from appearance frequency information on a part of 

15 the unique character string appearing in the input document; 
and 

program code means for directing the computer to calcu- 
late a similarity factor of the comparison document from the 
2Q appearance frequency information on a part of the unique 
character string appearing in the comparison document and 
the weight value. 

DESCRIPTION OF THE DRAWINGS 

25 A preferred embodiment of the present invention will be 
explained with reference to the drawings, where: 

FIG. 1 is a block diagram showing a hardware configu- 
ration of a preferred embodiment of the present invention; 
FIG. 2 is a block diagram of the system configuration 
30 processing components for the hardware of FIG. 1; 

FIG. 3 is a diagram showing the character chain file and 
position information file structure of an index file for search- 
ing documents; 

FIG. 4 is a diagram showing an extended character chain 
3 file and extended position the structure of an index file for 
searching documents; 

FIGS. 5-6 illustrate a flowchart showing an index file 
creation process; 
40 FIG. 7 is a flowchart of a character search process using 
the index file; 

FIG. 8 is a flowchart of an ambiguity search process using 
the index file; 

FIG. 9 is a flowchart of an ambiguity search process using 
45 the index file; 

FIG. 10 is a flowchart showing extraction of a unique 
character string from an input document; 

FIGS. 11-15 are diagrams showing user interfaces in a 
preferred embodiment of the present invention. 

50 

DESCRIPTION OF THE PREFERRED 
EMBODIMENT 
A. Hardware Configuration 

Referring to FIG. 1, there is shown a schematic view of 

55 a system configuration for implementing the present inven- 
tion. In the configuration, a bus 101 is connected to a central 
processing unit (CPU) 102 having capabilities of arithmetic 
operation and input/output control, a main memory (RAM) 
104 for loading a program and providing an operating 

60 environment for the CPU 102, a keyboard 106 for key 
inputting a command or a character string to be searched, a 
hard disk 108 storing an operating system, a database file, a 
search engine, an index file or the like, a display device 110 
for displaying the result of search for the database, and a 

65 pointing device (including a mouse or a track ball) 112 for 
pointing to any location on the screen and transmitting its 
position information. 
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Therefore, it can be easily understood that the present embodiment of the present invention, the created index file 

invention can be implemented on an ordinary personal comprises four files including a character chain file, a 

computer (PC), a workstation, or combination thereof. In position information file, an extended character chain file, 

addition, there is shown a storage medium 114 for storing and an extended position information file, 

program codes which provide instructions to the CPU or the 5 \\ Where the character chain file stores fixedjejglk^hains 

like in cooperation with an operating system to implement and delimiter patterns, the position information file com- 

the method of the present invention. The storage medium P nscs Position numbers in the document and document 

may include a floppy disk, a CD-ROM, an MO, a PD, and n ™*>™ corresponding thereto. The exte^a>d_character 

a storage device connected to a network. The program codes cham file^stoi^extended t fixed length chains and the 

may be divided into a plurality of segments or compressed, 10 e * ended P osltlOD formation f stores van*k ength 

j . , ... r ~Jf . ... chain numbers and position numbers in the variable length 

and stored in a plurality of media. THe storage medium 114 ^ , n ^ embodiment of the present invention, 

is loaded on the system through various drives such as a searcfa can £ at a high spe £ by ^ g such 

floppy disk drive, a modem, or a serial port, whereby the search files However> since appearance frequency of a 

system shown in FIG. 1 is constituted as the system of the charact er string can be calculated rega rdless of the format 

present invention. 15 for sfoTingthe document, use ot such a search file is not an / \ t 

The operating system is desirably those supporting a GUI essential requirement of the present invention. — N\ 

multiwindow environment as a standard feature such as [q addition, the database 202 may store individual docu- 

Windows (trademark of Microsoft),' OS/2 (trademark of ments as separate files, or 'may sequentially arrange all 

IBM) or X- Window system on AIX (trademark of IBM), but documents in a consecutive single file. In the former case, 

may be implemented on a character base environment such 20 the database 202 manages a table which causes the unique 

as PC-DOS (trademark of IBM) or MS-DOS (registered numbers for individual documents to correspond to actual 

trademark of Microsoft). The invention is not limited to a file names storing the documents. In the latter case, the 

specific operating system environment. In addition, while database 202 manages a table which causes the unique 

FIG. 1 shows a system in stand alone environment, because numbers for individual documents to correspond to offset in 

the database file generally requires a disk device with a large 25 the single database file and the size of document. In 

capacity, it may be possible to implement the present inven- summary, it is essential that individual documents are pro- 

tion as a client/server system in which the database file and vided with unique numbers, and the content of individual 

the search engine are placed on the server machine, and the documents can be accessed with such unique numbers, 

client machine is LAN connected to the server machine A search engine 208 has capabilities to search the index 

through Ethernet or token ring, and arranged to have only an 30 file 204 with a search character string given by a search 

input control function for identifying an input document and character input module 210 as an input, and to return a 

a display controller for viewing the result of search. document numbers) of the documents) containing the input 

B. System Configuration search character string and a positions) at which the input 

Referring to FIG. 2, a system configuration of the present search character string appears in the document. The search 

invention is described. It should be noted that components 35 character input module 210 preferably consists of a dialog 

represented by respective blocks in FIG. 2 are separately or box in the multiwindow environment, where the desired 

collectively stored as a data file or program file in the hard characters) to be searched is input through the keyboard 

disk 108 of FIG. 1. Database 202 stores a plurality of 106 and displayed in the input box. According to a feature 

documents such as a database of newspaper accounts, or a of the present invention, the search character input module 

database of patent publications. However, it should be noted 40 210 identifies an input document from which a unique 

the application of the present invention is not limited to character string is extracted. Specifically, titles of input 

search of a database consisting of a plurality of documents, documents are displayed on a display screen, and when the 

but includes searching in a single document. user selects a title with the pointer of the pointing device 112 

Contents of individual documents are searchably stored, such as a mouse, the system recognizes that a document 

for example, in a text file form. In addition, each document 45 corresponding to the displayed title is selected. The input 

is provided with a unique document number. While prefer- document may be specified by directly entering information 

able document numbers are ascending sequential numbers sufficient to identify the document to be input through the 

starting with 1, in the case of a patent publication database, keyboard 106. 

application numbers or laid-open numbers may be used as In addition, the search character string input module 210 

unique document numbers. In addition, symbols such as 50 can set the amount of feature of a unique character string 

"AIT or "&XYZ" may be used to identify individual (extracting a candidate character string as a unique character 

documents in place of the sequential number. However, string when the points of the candidate character string 

since such identification symbols generally require more exceeds a preset threshold) to be described later, or the 

bytes than numeric identifiers, it is preferable in practice to number of character strings in a unique character string 

identify documents with the sequential numbers. 55 (extracting candidate character strings as the unique char- 

A preferred embodiment of the present invention can acter string from those with higher point values), 

attain a high speed search for a document written in either Furthermore, the search character string input module 210 

a language which has a large number of characters, but does enables input of the similarity factor for an ambiguity search 

not have an explicit word delimiter such as Japanese and in a numerical value of 0-1 (which may be a value of 0-100 

Chinese (non-delimiter language), or a language which has 60 on the basis of percentage). Thus, the search character string 

relatively small number of characters, and is expressed with input module 210 displays a slider or scroll bar having a 

explicit delimiters such as English (delimiter language). handle which indicates any position between 0 and 1. The 

In general, since it takes a long processing time to directly handle of slider may indicate, for example, 1 as the default, 

search enormous contents such as news articles or patent and may be operated to indicate another value by dragging 

specifications stored in the database 202, an index file 204 65 and moving the handle with the mouse 112. 

is previously created by an index creation/update module A result display module 212 accesses the database 202 

206 for the contents of all news articles. According to an based on the document number as the result of search given 



10/30/2003, EAST Version: 1.4.1 



6,041,323 

11 12 

by the search engine 208 and the value of position at which 905. Also, the title or the like of the document currently 

the search character appears in the document, and displays displayed in the window 907 is highlighted and displayed in 

a line corresponding to that position in the document. The a window 909. 

line is preferably output in a separate display window. If the (4) The user causes the content of the document in the 

search result cannot be contained in one screen of the 5 window 907 to change by clicking the title in the list of titles 

window, a scroll bar appears so that the user can sequentially in the window 909 or the like. Then, after reading several 

view the search results by clicking the scroll bar. In the articles, the user selects an article "Olympics Edition — 

preferred embodiment of the present invention, the result Successful Olympic Winter Games, Appealing Environment 

display module 212 has a capability to display the extracted Conscious Nation" from the window 909, and clicks the 

unique character string once on the display 110. The user 10 button. While the user wishes to read articles on Olympic 

may add, delete or modify the unique character string, Winter Games held several years ago, he or she is informed 

change weighting on each unique character string, or set a that the keyword for articles he or she wishes to read is "Lille 

condition such as AND or OR on the unique character string, Hammer" after viewing this article. 

and then may perform search by using the unique character (5) The user clicks Search Similar button 947 to read 

string after such modification. 15 articles resembling this article (it may be possible to perform 

C. Operation the search again for the set of documents extracted in the 

FIG. 10 shows the steps of the search method in the search for Olympics by entering "Lille Hammer" in the 

preferred embodiment of the present invention. First", the entry 100). 

process starts with step 801. Then, a sentence is input (step (6) The system detects this input, and performs the search 

803). Then, a candidate character string is determined by 20 with the method according to the present invention using 

extrac ting a partial c harartfr strinp w hirh appears mu ltiple "Olympics Edition — Successful Olympic Winter Games, 

times from this input sentence (step 805 )^Tkgn, character Appealing Environment Conscious Nation" as the input. 

strragTwrncWfivide alphanumeric/katakana are eliminated Specifically, a unique character string is extracted from that 

from the candidate character strings (step 807). article, and the similarity search is performed by using that 

Subsequently, points (an amount of feature) are marked on 25 unique character string. 

the remaining candidate character strings to indicate how (7) The system displays the result of the search on the 

much each of the candidate character string constitutes the screen. Specifically, the list of titles is sequentially output in 

feature of the input document (step 809). Then, the character the window 909 in descending order from the highest 

strings are adopted or rejected as unique character strings matching factor. While, in the embodiment, the content of 

based their relationship to the candidate character strings 30 the document with the highest matching factor is displayed 

and their point values (step 811). In addition, the unique in the window 907, and its document number, title or the like 

character strings are determined, for example, in the are displayed in the window 905, it may be possible to 

descending order of point assignments (step 813). Then, the display the content of a document with the next highest 

entire set of documents is searched by using the determined matching factor in the window 907. The input title is 

unique character strings as the search character strings (step 35 displayed first because the document with the highest matcb- 

815). Then, the documents found by the search are evaluated ing factor corresponds to the input document used for the 

(step 817), and the titles of documents or the like are output search. In the window 907, a character string matching or 

in the order of evaluation (step 819). similar to the unique character string in the content of the 

Prior to the detailed description of each process step in the searched document is highlighted. Displayed in the window 

preferred embodiment of the present invention, examples of 40 903 is a title of the search for accessing the result of search, 

operations available to the user will be described with Here, information such as the serial number, the number of 

reference to FIGS. 11 through 15. This example illustrates a documents, and the searched titles is displayed. When the 

search of a database containing a plurality of news articles. title of "Olympics" previously searched is clicked, informa- 

Here, the example uses a database of news articles of Ninon tion similar to that in FIG. 11 is displayed again in the 

Keizai Shimbun for one year for which IBM ("IBM" is a 45 windows 907 and 909. 

trademark of IBM Corporation, U.S.A.) is granted a license (8) After the user reads several articles output as the result 

for use of copyright from Nihon Keizai Shimbun-sha of search by scrolling the window 909, then he or she may 

("Ninon Keizai Shimbun" is a trademark of Nihon Keizai wish to read articles on snow-boarding. The user thus selects 

Shimbun-sha). the article of "Issue on Olympics: Snow-board — Discussion 

(1) The user inputs "Olympics" in an entry area 901 50 on Adoption as Formal Event ..." in the window 909, and 
shown in FIG. 11 via the keyboard 106, and presses the clicks the button. 

Enter key, or clicks the button for Execute Search 931. (9) The user clicks Search Similar button 947 to read 

(2) The system detects the user input, and searches articles articles similar to that article on* snow boarding, 
containing a character string "Olympics" by a conventional (10) The system detects this input, and performs again the 
search process or the ambiguity search process to be 55 search with the method according to the present invention 
described later. using "Issue on Olympics: Snow-bo ard-— Discussion on 

(3) Then, the system outputs the result of search on the Adoption as Formal Event ..." as the input. 

screen. Specifically, a list of titles 927 on various articles (11) The system outputs the result of search as shown in 

relating to Olympics such as Mathematics Olympics, a store FIG. 15 on the screen. 

called Olympic, or Nagano Olympics is output a window 60 The user interface in the preferred embodiment of the 

909 in the order of matching factor together with serial present invention has various additional functions. For 

numbers 921, matching factor 923 and the date of articles example, a pull-down menu 911 is one for inputting a search 

925. In this embodiment, a document with a high matching condition such as AND or OR, or selecting the number of 

factor 100 is selected, and data identifying that document is documents set to be extracted as the set of search results, or 

stored. In addition, the content of a document with the 65 an allowable similarity factor. A pull -down menu 913 is one 

highest matching factor is caused to be displayed in a for selecting a matching factor of a character string in 

window 907, and its title or the like is displayed in a window performing the ambiguity search. In addition, a pull-down 
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menu 915 is one for selecting whether the subject of search Dl. Creation of Candidate Set for Unique Character String 
is an entire document or a set of partial documents such as A candidate set for the unique character string is formed 
a set of searched documents. When the search is performed in accordance with the following rules, 
again for the set of searched documents, unique character Extraction Rule 1 Partial character strings of N characters 
strings are extracted by comparing the input document and 5 or more appearing in the input document two or more times 
a set of documents as the result of search limited to a arc cxtract ed, and added to a candidate set of the unique 
category. Thus, it is possible to extract a character suing character string. Symbol characters such as " " and "(" 
which is a feature of the input document from a plurality of are exc]uded dmiQ me extraction> N ^ the number of 
documents containing similar contents. In addition the characters m me cnaracter chains held in the mdex file> In a 
pull-down menu 915 enables selective searching for a hm- „ , , . 4 , , ^ ,. , , ^ . , 
ited part of a document such as searching for only titles, 10 detaml " ^f a g^<* f English, a word may be extracted 
instead of the entire document. In this case, such a search a ? f P^ ial C , har * cter f m * A la the extraction, it is permis- 
can be attained by creating an index file only with titles, by sib e that / P"? of w ° rd * are averted to one concep- 
embedding characters or symbols in the document for dis- mal word when Ae words have a ^mmon meaning. An 
tioguishing the title and the body and detecting the embed- example of words that may be converted include "display- 
ded characters to exclude the body from the subject of the 15 ""display device", "CRT" or the like to "display device", 
search, or by causing the title to exist at a fixed area in the and tnen tne partial character strings are extracted. In 
document and performing the search only for that area. A addition, if desired, it may be possible to perform normal- 
pull-down menu 917 is a one for selecting whether search is ization such as conversion from upper case to lower case, 
performed by differentiating between the upper and lower conversion from double byte characters to single byte 
case of alphabet. 20 characters, conversion from the plural to the singular, or 
A button 933 initializes the pull-down menus 911 to 917. conversion of tense from the past or past perfect to the 
For example, when the user has changed the pull-down present. In addition, it is possible to perform effective 
menu 913 to 80%, and the pull-down menu 915 to the extraction of the unique character string at a high speed by 
document set 1 (the set of documents as the result of search excluding character strings such as "a", "the", or "is" which 
by the character string "Olympics"), the pull-down menus 25 generally do not constitute a feature of the document. 
913 and 915 are returned to the initial state of 100% and the Exception Rule 2 When the partial character string 
entire document, respectively, by clicking the initialization extracted by the extraction rule 1 divides continuation of an 
button 933. A button 935 is a delete search result button. alphanumeric into K characters or less at its start or end 
While the system stores information for identifying a docu- position, the partial character string is excluded from the 
ment set as the result of search, clicking of this button causes 30 candidate set for the unique character string. For example, if 
the system to release the information on the document set K is 4, it is possible to prevent "vision" from being extracted 
inversely displayed in the window 903 at present, and to from "television". 

delete the titles of the document set from the window 903. The purpose of this step is to extract character strings 

A button 941 scrolls through the document so as to display which compose a significant character set based on the 

a next unique character string (or matched character string, 35 relationship among its characters. In other words, a character 

or similar character string) in the document. A button 943 is string is preserved if it is meaningful even if it is solely 

used to display a document with the next higher similarity extracted as defined by Strategy 2. The exception rule 2 is 

factor, while a button 945 is to display a document with the based on experimental knowledge that when an alphanu- 

next lower similarity factor. meric string is too finely divided, it becomes meaningless. In 

In the sequence of steps, (1) through (4) are performed by 40 addition, with the above rules it becomes possible to take 

a known search approach, or an approach for ambiguity into account different forms of English words, 

search described later. Steps (5) through (11) in accordance D2. Assigning Points to Determine a Unique Character 

with the present invention are described in the following String Candidate 

section. A higher point is accorded to a candidate unique character 

D. Extraction of Unique Character String 45 string (hereinafter "a candidate character string") with a 

In the search method of the present invention, a unique relatively lower appearance frequency in the entire set of 

character string is first extracted from an input document. documents and higher appearance frequency in the input 

The unique character string is extracted in accordance with sentence. Thus, the simplest calculation formula is: 
two strategies: (1) extracting a character string with an 

appearance frequency in the input sentence that is relatively 50 EQUATION 1 

higher than that in the entire set of documents; (2) extracting AmouM of feature-Appearance frequency of candidate 

a character string which is meaningful even if it is solely character string k ^ sentence/Appearance frequency of 

extracted. ....... . candidate character string in entire set of documents 

For Strategy 1, the above-mentioned mdex file is used. , . . . . . , • , • 

The index file holds all N-character strings in the entire set 55 Id addltlon ' ? hei1 the ° UmbCt c °! cbaract6rs m < he 1D P ut 

of documents, and position information data of their appear- sentenoe and th f eDtlr f "f ° f ^^f 1 " 8 . 15 ta ^ n , mto 

ance in a unique format. The quantity of position informa- acco, \ nt ' 11 can be re P laced b ? the followin g calculation 

tion data changes in a substantially proportional way with orm a * 

respect to the appearance frequency of corresponding EQUATION 2 

N-character strings in the entire set of documents. The index 60 

file can be searched at a high speed in view of the unique Amount of feature=(Appearance frequency of candidate 

structure of its index. During the search, the size of position character string in input sentence* number of characters in 

information data is utilized as a value indicating the appear- entire set of documents)/(Appearance frequency of candi- 

ance frequency of N-character string in the entire set of date character string in entire set of documents*number of 

documents. 65 characters in input sentence) 

Detailed steps for extracting the unique character string Since the position information file shown in FIG. 3 is used 

are described below. in the preferred embodiment of the present invention, points 
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can be assigned according to Equation (1) in such a manner 
that (1) a higher point is given to a candidate character string 
containing an N-character chain with less appearance fre- 
quency in the entire set of documents, but higher appearance 
frequency in the input sentence, and (2) a higher point is 
given to a candidate character string with a higher appear- 
ance frequency in the input sentence. Point assignment is 
performed according to Equation 3, with the variables 
described below. 

count i: Appearance frequency of i-th candidate character 
string in input sentence 

Ncount fi j: Appearance frequency of j-th N-character<"\ 
chain of i-th candidate character string in input sentence (In 
the case of a delimiter language such as English, appearance 
frequency of a word in an input sentence.) 

Nsize i j: Value (number of bytes) of position information 
data corresponding to j-th N-character chain of i-th candi 
date character string (the number of position numbers in 
document of the position information file shown in FIG. 3 
can be utilized as the appearance frequency information of 
N-character chain in the entire set of documents) quantized 
to Q levels (Q being a constant) (in the case of a delimiter 
language such as English, the position information of that 
word is used) 

Nnum i: Number of N-character chains contained in i-th 
candidate character string. 



10 



25 



EQUATION 3 

Point of i-th candidate character string (amount of feature 
1)- 



30 



^ (Ncount ij/ Nsize ij)/Nnmn ix 



count ixMax(Nsizc ij) /Max(Ncount ij) 



This point assignment strategy corresponds to Strategy 1 
As described above, the position information data is substi 
tuted for the appearance frequency of an £l = cjianictex.chain 
in the__£ Dtire set of documents, an d the q uantization is to 
a djust for the difference of granularity betweenJ he^uiiitsXor 
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Mffnint.flnH lsfojg^_Thf. purpose of multiplication by Max 
(Nsize i j) and division by Max (Ncount i) is to adjust for the 
difference between the total amount of input sentence and 
the set of documents. For example, a lower point would be 
given to common character strings such as "the", "a", and 
"an". 

The method for assigning points on the candidate char- 
acter string may be modified by those skilled in the art. For 
example, the number of appearances may be effected such 
that IS is added to each appearance of a character string at 
a position in a document with higher importance such as a 
heading or title in the input document, while a smaller value 
of 0.5 is added to the number appearances at a position in a 
document with less importance such as a footnote or a 
quotation. 

Application of this equation is described on the following 60 Search only, 
sample of English document. While an English example is 
set forth here, the equation may also be applied in searching 
Japanese language documents. An example of the applica- 
tion of the equation to a Japanese language text is set forth 
in JPA 8-095691. 

The following example is based on one document in a 
database of news articles of Yomiuri Shimbun for which 



IBM is granted a license for use of copyright from Yomiuri 
Sbimbun-sha ("Yomiuri Shimbun" is a trademark of Yomi- 
uri Shimbun-sha). The values of MAX (NCount ij); MAX 
(NSIZE ij); NSIZE 1,1; and NSIZE 2,1 are based on the 
database contents. 

Sample of English document 

Ranking Search and Fuzzy Operation 

Ranking Search returns a list of documents in the order of 
the score which is level of relevance to specified search 
condition. The maximum number of the returned documents 
is specified by the user program. The Ranking Search allows 
the user to start looking into documents with the most 
desirable one, which realizes efficient and effective search 
task. The following three factors are selectable among the 
factors to decide the score of documents: 

a. Frequency of search terms in the document As the 
search term appears more frequently in the document, 
the score of the document gets higher. 

b. Frequency of search terms in the whole set of docu- 
ments As the search term appears less frequently in the 
whole set of documents (all the documents indexed), 
the search term contributes to the score of the document 
more. 

c. Weight parameter specified explicitly by the user pro- 
gram As the weight of the search term is larger, the 
search term contributes to the score of the document 
more. 

The user program can specify which factors to use: one of 
them, two of them, or all of them. It is allowed to use none 
of them, and in that case the score is decided by whether the 
document contains the search term or not. Usually, speci- 
fying a and b is recommended. 

The way to choose the documents to be scored is select- 
able from the following two: 

Strict Boolean operation 

Scoring is done for the set of documents as a result of the 

traditional Boolean operation. 
Fuzzy operation 

Scoring is done for all the documents containing at least 
one search term. In this case, the operator is said to be 
a Fuzzy operator such as "Fuzzy AND". 

Fuzzy Operation 

By Fuzzy AND operation, for example, the result of "A 
AND B AND C" is evaluated higher in the following order: 
The document containing all of the three 
The document containing two of the three 
The document containing one of the three 
By Fuzzy NOT operation, for example, the result of "A NOT 
B" is evaluated higher in the following order: 

The document containing "A" and not containing "B" 
The document containing both "A" and "B" 
The traditional strict Boolean operation has an advantage in 
the speed, but it does not allow to evaluate the intermediate 
status. 

By using Fuzzy operation, the intermediate status, such as 
"The document contains not all search terms but almost all" 
is evaluated, therefore the result is natural to the way of 
human thinking. Fuzzy operation can be used in Ranking 



65 



In the above example, the extracted character strings are 
"fuzzy, "search", "document", "operation", "score", 
"term", "contain", "rank", and "evaluate". 

Here, if the first candidate character string is "fuzzy", and 
the second candidate character string is "and", the point 
allocation for each of these candidate strings can be deter- 
mined according to equation 3 as follows. 
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Calculation of points for "fuzzy" is based on" appearances is used as is, and if the weight of the k-th search 

character string (unique character string) is weight k, the 
similarity factor of document score(d) of document d is 



Max (Nsizc = 390612 

U 

Nsize 1, 1 - 3028 
Nsizc 2, 1 - 169568, 
then 

Noount 3,1-9 
Nnum 1=1 
count 1=9 

Max (Ncount i, j) - 59. 
i» j 



Thus, the point of a candidate string "fuzzy" is: score 
l»9/3028/lx9x390612/59«177.10. 

Calculation of points for "and" is based on Max (N, 
Max (Nsize ij); N size 2,1 as set forth above, and 

Ncount 2, 1=10 

Nnum 2=1 

count 2=10, 

Thus the point of a candidate string "and" is score 
2-10/169568/1x10x390612/59-3.90. 

Here, if a threshold value for selecting a string as a unique 
string is 150 points, then "fuzzy" is a unique string and 
"and" is not. 

D3. Adoption or Rejection of Candidate Character String 
Based on Degree of Overlapping 

Candidate character strings satisfying either of the fol- 
lowing conditions may excluded from the set of candidates. 

Condition 1 A character at the second character or later is 
the first character of another candidate character string with 
higher point allocation. 

Condition 2 A character at the second to the last character 
or former is the last character of another candidate character 
string with higher point allocation. 

If the candidate character string to be excluded contains a 
character string of a length of N or longer not^ove rlapping 
any otHeFcand idate cha racte r string s, it~is markeg~~wtth 



points" according to the calculation step and is added to the 
set of candidates for the unique character string. The purpose 
of the above conditions is to eliminate a candidate charact er 
string having a weak r elationshi p"to^"characters appearing 
before or after the string! — — . 
D4. ^termination oTUnique Character String From Can- 
didate Character Strings 

A character string having points of X and Y or more is 
determined to be a unique character string. X and Y are 
constants. 

E. Search with Unique Character String 

j $XI Documents containing the unique character string are 
searched from the entire set of documents. In the preferred 
embodiment of the preserit invention, do cuments containing 
a character string similar to the unique character string are 
searched fr om the entire set of documents with the amb l- 
guitysearcTi. The details of the ambiguity search is described 
later. The search match factor for each unique character 
string (a search parameter indicating how much ambiguity is 
allowed) is suitably determined. 

F. Output of Found Documents in the Order of Evaluation 
The found documents are evaluated and arranged in the 

order of evaluation. The similarity factor of a document is 
evaluated in such a manner that the number of appearances 
of each unique character string in the input sentence is used 
as weight. Additionally, a higher evaluation is provided for 
a document in which unique character strings with a higher 
weight appear many times. The simplest calculation formula 
can be expressed where the above-mentioned number of 
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EQUATION 4 

score(rf) = 

(weight k x appearance frequency of k- th search characterstring). 
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In addition, such similarity factor may be converted by 
various function equations so that it is dispersed between 1 
and 0. Similarly, the weight value may be also dispersed 
between 0 and 1. 

In the preferred embodiment of the present invention, the 
similarity factor of a document is evaluated by taking the 
matching factor of the result of an ambiguity search into 
account, using the number of appearance of each unique 
character string in the input document, and using the equa- 
tion below [Equation 5] so that a document in which many 
unique character strings with higher weight appear is pro- 
vided with higher evaluation. 

When the weight of the k-th search character string 
(unique character string) is represented as weight k, and the 
matching factor of the first hit (character string similar to the 
search character string) in a document of the number d as 
percent (d, k, 1). 

EQUATION 5 

scorel(d) = ^ (weight k x Max(percent(rf, k, 1))J 

score2(rf) =s ^weight * x ^ perccnt(d, k, l)j 
score(rf) = Max(score2(x))x score \{d) + score2(d) 



40 



The above [Equation 5] can be substituted to the follow- 
ing equation [Equation 6], where g (d) is a suitable function 
including the length of document d, and increasing with the 
45 length of document d according to the proportion: 

Length/(length+C): C being a constant as the length 
increases. 

In addition, T, W, and S are suitable constants larger than 
0 but less than 1. 
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EQUATION 6 

/(<U)=£percent{<a,l) 

t{d, k) = 7+ (1 - J) * /(rf. k)f {f{d. k) + gid)) 

*<d, k) = weight kj £ weight k x ( W +■ (1 - W) * t{d, k)) 

scored) = 5 x MGn(w(rf, k)) + (1 - 5)x £ w(rf, k)(n(d) 



where: 

"f (d, k)" indicates appearance frequency for which the 
result of similarity search is taken into account. 
65 "t (d, k) M is to normalize the "f (d, k)" to 0-1 considering 
of the length of the document; and 
"w (d, k)" is added with a weight value. 
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G. Structure of the Index File and How to Create It timious. While the character string may include a plurality of 

In the present invention, files which index a series of variable length chains, a variable length chain can still be 

characters belonging to a character set C (variable length extracted by using a blank, line feed, "!", or "?" as 

chains), all continuous N characters not belonging to the a delimiter. 

character set C (fixed length chains), their positions in the 5 For example, in the case of "Boys be ambitious." or "Boys 

document, and division information in the document are feed ) be ambitious/', three variable length chains of 

created. FIG. 3 illustrates a document chain file 302, and a "Boys", "be", and "ambitious" can be extracted, 

position information file 304. Here, the "character set C" Additionally, in the case of "line feed" following "-" and not 

means a predetermined set of characters, preferably in including any blank before or after it, continuations of 

alphabetic form (i.e., 'A'-'Z', and 'a'-z'). However, it is 10 characters before or after the "_" may be determined to 

contemplated that the set C may include characters used in comprise a single variable length chain. Accordingly, even 

languages such as German, French, Italian, or Russian, may in the case such as "Boys be ambi-(line feed)tious", three 

be alphanumeric, and be of single or double byte characters, variable length chains of "Boys", "be", and "ambitious" are 

and may include several symbol characters such as "?", extracted, 

or Moreover, the "division information in document" 15 G4. Position Information 

means typically a delimiter in a sentence such as "." or ",", Io lhe preferred embodiment of the present invention, 

and a delimiter in document in a broader sense such as everv individual document is divided into blocks in a 

"Chapter 1", "Summary", a blank line, or a blank character manner such that they are meaningful for search. Division 

( s ). information is stored in the index file. The document may be 

Files are created in response to the variable length chains, 20 divided into blocks by detecting line feed, a period, 

which index all of continuous N" characters in all variable punctuation, "Chapter X", or "Section X", detecting a blank 

length chains (extended fixed length chains) and their posi- nDC » or detecting the paragraph number in a patent 

lions in the variable length chains with the variable length specification, or a certain number of characters may be 

chains. FIG. 4 illustrates an extended character chain file incorporated as one block. These blocks are assigned a series 

306 and an extended position information file 308. 25 of numbers or block numbers. A specially defined delimiter 

The four files: the document chain file 302, the position pattern is stored together with the document number of the 

information file 304, the extended character chain file 306, document and the position information in the document for 

and the extended position information file 308 are not characters at the boundaries of blocks, 

necessarily physically different files. It is sufficient that the Several different division methods can be obtained by 

information be stored in such a manner that the content 30 defining several types of delimiter patterns. However, the 

controlled by each file can be logically processed. delimiter pattern should be defined not to overlap character 

Gl. Normalization of Character String chains being extracted from a normalized character string. 

In a preferred embodiment of the present invention, Since one-byte codes may be converted into two-byte codes 

character strings are normalized prior to creating the index through the normalization, if two bytes are assumed to be 

file. For example, when the document to be searched is a 35 one word, when the value of the word is 255 or less, it is not 

Japanese document file, single-byte and double-byte char- applicable to a normal character code. Then, any word value 

acters may be intermixed. Normalization processing is per- between 0 and 255 can be individually assigned to several 

formed to replace single-byte characters with corresponding types of delimiter patterns. 

double-byte characters (or vice versa), and lowercases with Th c advantages to storing the division information in a 

uppercases (or vice versa). The normalization of character 40 format similar to that of the character chain are as follows: 

strings is not essential component of the present invention. Easy to create and update the index. No particular pro- 

The details of normalization may be changed to normalize cessing is required for the division information; and 

the string according to a user's specifications. It may be The capacity of the index is not significantly increased, 

desired to perform normalization conversion between the Increase in capacity is significantly small when compared 

plural and the singular, and conversion of tense from the past 45 with a format which appends corresponding block numbers 

or past perfect tense to the present tense, etc. to every position number in the document. The position 

G2. Extraction of Fixed Length Chain Information number in a document is a unique sequential number in the 

The next step for creating the index file is to extract, for document block and is assigned to all characters to be 

all characters to be searched in the normalized character searched in the document. The position number in the 

string and not belonging to the character set C, continuous 50 document for the first character in a character chain is 

N characters starting from these characters (hereinafter determined to be the position number in the document for 

called "fixed length chains"), and to store them in the index that character chain. If a fixed length chain is less than N at 

file together with the document number and the position the end of continuation of characters not belonging to the 

number in the document. For values of N,N^1 (specifically, character set C together with subsequent characters, pre- 

N=2) is suitable for Japanese, Chinese and Korean. 55 defined padding characters such as X'00' are padded to 

It is desirable not to search series of a character belonging make the number of characters N. 

to the character set C and an adjacent blank character in G5. Extraction of Extended Fixed Length Chain Information 

order to reduce the size of the index file. The next step to create the index file is to extract, for all 

G3. Extraction of Variable Length Chain Information characters in all variable length chains, continuous N' char- 

The next step to create the index file is to extract con- 60 acters starting from these characters (hereinafter called 
tinuations of characters in the normalized character string "extended fixed length chains"), and to store them in the 
and belonging to the character set C (variable length chains), index file together with the extended character chain number 
and to store them together with the document number and and the position number in the extended character chain, 
the position number in document in the index file. As set However, N'^l (specifically N=3) is suitable if the char- 
forth above, the character set C need not be alphabetic. In 65 acter set C is alphabetic. The search speed can be improved 
this case, the character string may contain a plurality of when an extended fixed length chain is extracted after 
variable length chains in which character strings are con- appending a start mark and an end mark before and after a 
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variable length chain, respectively. For example, when "$" 
and "¥" are used as the start and end marks, respectively, 
extended fixed chains "$ca", "cat", "at¥", and "t¥" are 
extracted from a variable length chain "cat". Then, it 
becomes possible to eliminate mixing of "communication" 
or the like as noise in determining matching of "$ca" or the 
like. 

G6. Example of Position Number in Document 

For example, it is assumed that a document containing a 
sentence "data base system-123" at the beginning is con- 
tained in the database 202 (FIG. 2). If blank characters 
adjacent to the characters belonging to the character set C 
are not searched, and assuming that the above character set 
C is alphabetic, then when a position number in a document 
is appended to each character in this sentence, they become 
as follows. 

TABLE 1 

Position number in document 1234 5678 9 10111213141516171819 
for characters 

Normalized character string data base system-12 3. 
Delimiter method 1 



Then, it is assumed that the document number for that 
document is 1, and that the number of characters N for the 
fixed length chain is 2. Then, individual fixed length chains 
(length 2), delimiter patterns and document numbers asso- 
ciated thereto, and position numbers in the document are as 
follows. 



TABLE 2 



Document 


Position number 


Fixed length chain number 


in document 


-1 1 


15 


12 1 


16 


23 1 


17 


3. 1 


18 


3 1 


19 


Delimiter pattern 1 1 


19 


Individual variable length chains, document numbers asso- 


ciated thereto, and position numbers in document are as 


follows. 




TABLE 3 




Variable Document 


Position number 


length chain number 


in document 


data 1 


1 


base 1 


5 


system 1 


9 



Then, it is assumed that the numbers appended to the 
variable length chains are sequentially 1, 2, and 3, and that 
the number of characters N f of the extended fixed length 
chain is 3, individual extended fixed length chains (length 
3), variable length chain number associated thereto, and 
position numbers in variable length chain are as follows. 
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5 



Extended 


Variable 


Position 


fixed 


length 


number in 


length 


chain 


variable 


chain 


number 


length chain 


dat 


1 


1 


ata 


1 


2 


ta 


1 


3 


a 


1 


4 


bas 


2 


1 


ase 


2 


2 


se 


2 


3 


c 


2 


4 


sys 


3 


1 


yst 


3 


2 


stc 


3 


3 


tern 


3 


4 


em 


3 


5 


m 


3 


6 



When it is permissible to append a plurality of variable 
length chain numbers and position numbers in a variable 
length chain to the extended fixed length chain, the whole 
capacity can be compressed, and high efficiency can be 
obtained particularly for a document with many overlapped 

25 words. In addition, since overlapped searches can be elimi- 
nated by putting such overlapped character strings together, 
a high speed search can be performed. 
G7. Role of Division Information in Document 

Now, usefulness of the division information (delimiter) in 

30 a document to be searched is described. 
Search only for a specific block 

When a document is composed of the title, the abstract, 
and the body, for example, it may be common to perform 
searches only for a specific portion such as the title and/or 

35 the abstract. Such a search can be attained by storing 
delimiter patterns and their position information for the end 
of the title and the end of the abstract. 
Search for document with strong association between a 
plurality of character strings 

40 It may be a common demand to perform searches with an 
awareness of the strength of association between a plurality 
of character strings depending on the context. For example, 
it is believed that there is a higher possibility of stronger 
association between strings when the character strings are in 

45 the same paragraph rather than when they are merely in the 
same document, and that the association between strings is 
further stronger if they are in the same sentence. It becomes 
possible to search a document in which a plurality of 
character strings are in the same block by storing delimiter 

50 patterns and their position information for the end of para- 
graphs or sentences so that search with awareness of the 
strength of association can be performed. 
G8. Structure of Index File 

It is necessary to store the character chain, the delimiter 

55 pattern, its document number, and the position number of a 
document in a manner that they can be efficiently extracted 
during searching. Thus, in this embodiment, as shown in 
FIGS. 3 and 4, the index file is composed of four files of the 
character chain file 302 (a file mainly storing the fixed length 

60 chain, the variable length chain, and the delimiter pattern), 
the position information file 304 (a file mainly storing the 
document number, and the position number in the 
document), the extended character chain file 306 (a file 
mainly storing the extended fixed length chain), and the 

65 extended position information file 308 (a file mainly storing 
the variable length chain number, and the position number in 
variable length chain). The character chain file 302 is 
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arranged to store information on where the fixed length 
chain, the variable length chain, the delimiter pattern, and 
the document number 312 and the position number in 
document 314 associated to them are positioned in the 
position information file 304. The position information file 
304 is arranged to store the document number 312 and the 
position number in the document 314. 

The extended character chain file 306 is arranged to store 
information on where the extended fixed length chain, and 
the variable length chain number 316 and the position 
number in variable length chain 318 associated to it are 
positioned in the extended position information file 308. The 
extended position information file 308 is arranged to store 
the variable length chain number 316 and the position 
number in the variable length chain 318. 

While the embodiment is described for a document in 
which the delimiter language and the non-delimiter language 
"are intermixed, those skilled in the art would easily under- 
stand that the present invention can be applied to a document 
only of the delimiter language or a document only of the 
non-delimiter language. In the case having only the non- 
delimiter language, since there is generally no need to take 
the variable length chain into account, the extended char- 
acter chain file 306 and the extended position information 
file 308 are not required. However, variable length chains 
may exist even in a non-delimiter language document if the 
document consists of keywords extracted from the abstract 
as in a patent specification. 

In FIG. 3, entries in the character chain file 302 are the 
fixed length chains, the variable length chains and the 
delimiter patterns in all documents in the database 202. The 
entries in the character chain file 302 are preferably sorted 
in ascending order in the order of code values of normalized 
character chains so as to enable dichotomized searching. 
"Delimiter pattern 1", "-1", "12" and the like are individual 
entries in the character chain file 302. Here, "delimiter 
pattern 1", for example, collectively indicates delimiters of 
a sentence or phrase such as ",", or and is assigned a 
special two-byte value. 

The position information file 304 of FIG. 3 stores at least 
one document number 312 corresponding to individual 
entries in the character chain file 302, and at least one 
position number in document 314 associated to each of the 
individual document numbers. 

To cause the entries in the character chain file 302 to 
correspond to those in the position information file 304, the 
individual entries in the character chain file 302 have offset 
information (not shown) from the beginning of the position 
information file 304 for corresponding entries in the position 
information file 304, and information on the size of the 
entries in the position information file 304. For example, the 
character chain file 302 seeks the position information file 
304 from its beginning based on offset stored in character 
chain file 302 with respect to "delimiter pattern 1", and reads 
number of bytes specified in the size information from the 
sought position, whereby it is enabled to collectively read 
with respect to "delimiter pattern 1" position number values 
of 16, 19, ... in the document number 1, position number 
values in the document relating to the document number 2, 
. . . , and position number values in the document relating to 
the document number n, if any. In addition, by storing 
information indicating the range where the fixed length 
chains, the variable length chains and the delimiter patterns 
are stored, it is possible to determine to which of the fixed 
length chain, the variable length chain or the delimiter 
pattern the information stored in the character chain file 302 
belongs. 
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Generally, the position number values in the document 
relating to the document number i are stored, for example, 
in a form of (document number i: 4 bytes), (number of 
position number in document k: 4 bytes), (first position 

5 number in document: 4 bytes), . . . (k-th position number in 
document: 4 bytes). Although, in this example, it is arranged 
to take 4 bytes for storing the absolute position of the 
document as a field for storing the position number in 
document, it is desirable in practice to store the offset from 

to one previous position number in the document so that the 
number of bytes is saved to 1-3 bytes. It is also desirable to 
reduce the file capacity by performing compression through 
coding. This may be applied to the fields for storing the 
document number and the position number in the document. 

15 In FIGS. 3 and 4, entries in the extended character chain 
file 306 are the extended fixed length chains in all variable 
length chains in the character chain file 302. The entries in 
the extended character chain file 306 are preferably sorted in 
ascending order in the order of code values of normalized 

20 character chains so as to enable dichotomized searching, 
"dat", "ata" and the like are the individual entries of the 
extended character chain file 306. 

The extended position information file 308 of FIG. 4 
stores at least one variable length chain number correspond- 

25 ing to the individual entries in the extended character chain 
file 306, and at least one position number in the variable 
length chain associated to each of the individual variable 
length chain numbers. 
G9. Process for Creating Index File 

30 Now, referring to FIG. 5, the process for creating the 
index file will be described. This process is one which is 
performed by the index creation/update module 206 of FIG. 
2 when initially building the database 202, or adding or 
deleting a document to or from the database 202. 

35 In FIG. 5, step 402 performs to secure a memory region. 
This is a process to obtain a work area with a predetermined 
size on the RAM 104 by calling a function of the operating 
system. 

In step 404, one document is read from the database 202 

40 to the memory region obtained in step 402. In step 406, the 
above-mentioned normalization is performed for the docu- 
ment read in step 404. In step 408, fixed length chains, 
variable length chains, and delimiter patterns are created by 
scanning the normalized document. Then, the fixed length 

45 chains, the variable length chains, delimiter patterns, the 
document number of the document, as well as the position 
numbers in the document of the fixed length chains, the 
variable length chains and the delimiter patterns are stored 
in the memory region obtained in step 402. 

so In step 408, as the fixed length chains, the variable length 
chains, the delimiter patterns, the document number and the 
position numbers in the document are being stored in the 
memory region previously obtained in step 402, the avail- 
able space in the obtained memory region may become 

55 exhausted. Therefore in step 410, a process is performed for 
checking whether or not the obtained memory region is full. 
If so, in step 412, the fixed length chains, the variable length 
chains, and the delimiter patterns, all of which are stored in 
the memory region, and the document number of the 

60 document, as well as position information in the document 
of the fixed length chains, the variable length chains and the 
delimiter patterns are sorted based on, character code values 
of the fixed length chains, the variable length chains and the 
delimiter patterns, the document number, and the position 

65 numbers in document, for example. The sorted data is 
written to the disk 108 (FIG. 1) as an intermediate file, 
whereby the memory region in which data written in the 
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intermediate file is stored is released for use in the subse- 
quent process. Then, the process proceeds to step 414. On 
the other hand, if it is determined in step 410 that there still 
remains a space in the memory region, then the process 
immediately proceeds to step 414. 

In step 414, it is determined whether there remain docu- 
ments not yet read in step 404 in the database 202. If so, the 
process returns to step 404. 

If it is determined in step 414 that all documents in the 
database 202 have been completely read, then the fixed 
length chains, the variable length chains and delimiter 
patterns, all of which are not written to the memory region 
obtained in step 402 and remain, and the document number 
of the document, as well as the position numbers in the 
document of the fixed length chains, the variable length 
chains and the delimiter patterns are also sorted based on the 
character code values of the fixed length chains, the variable 
length chains and the delimiter patterns, the document 
number, and the position numbers in the document, and 
written to the disk 108 (FIG. 1) as an intermediate file. 

Since writing of the intermediate files in steps 412 and 
416 causes a plurality of intermediate files to exist on the 
disk 108, and each of these intermediate files are previously 
sorted, step 418 performs a process to create the character 
chain file 302 and the position information file 304 shown in 
FIG. 3 from these intermediate files with conventional 
known merge/sort techniques, and to store them on the disk 
108. In addition, since the character chain may repeatedly 
appear several times in the original intermediate files, a 
process is performed here to put the entries of the same 
overlapped character chain into one, and to associate the 
related document number and the position number in docu- 
ment to the character chain. Thereafter, the intermediate files 
are no longer necessary and are deleted. 

Referring to FIG. 6, in step 420, one variable length chain 
is read from the character chain file 302 into the memory 
region preferably obtained in step 402. In the preferred 
embodiment of the present invention, since the storage 
position of the variable length chain in the character chain 
file 302 is stored in the character chain file 302 at the time 
it is created, it is possible to immediately access the top 
position of the variable length chain in the character chain 
file 302. 

In step 422, the extended fixed length chain is created by 
scanning the variable length chain. Then, the extended fixed 
length chain, the variable length chain number of that 
variable length chain, the position number in the variable 
length chain of the extended fixed length chain are stored in 
the memory region obtained in step 402. 

In step 422, as the extended fixed length chain, the 
variable length chain number and the position number in 
variable length chain are being stored in the memory region 
previously obtained in step 402, the available space in the 
obtained memory region may become exhausted. Then, in 
step 424, a process is performed for checking whether or not 
the obtained memory region is full. If so, in step 426, the 
extended fixed length chains, the variable length chain 
number, and the position information in the variable length 
chain, all of which are stored in the memory region, are 
sorted based on character code values of the extended fixed 
length chains, the variable length chain number and the 
position number in the variable length chain, for example. 
The sorted data is written to the disk 108 (FIG. 1) as an 
intermediate file, whereby the memory region in which data 
written in the intermediate file is stored is released for use in 
the subsequent process. Then, the process proceeds to step 
428. On the other hand, if it is determined in step 424 that 
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there still remains a space in the memory region, then the 
process immediately proceeds to step 428. 

In step 428, it is determined whether there remains 
variable length chains not yet read in step 420 in the 

5 character chain file 302. If so, the process returns to step 420. 
On the other hand, if it is determined in step 428 that all 
variable length chains in the character chain file 302 have 
been completely read, then the extended fixed length chains 
which are not written to the memory region obtained in step 

10 402 and remain, the variable length chain number, and the 
position number in variable length chain are also sorted 
based on the character code values of the extended fixed 
length chains, the variable length chain number and the 
position number in variable length chain, and written to the 

15 disk 108 (FIG. 1) as an intermediate file. 

Since writing of the intermediate files in steps 426 and 
430 causes a plurality of intermediate files to exist on the 
" disk 108, and each of these intermediate files are previously 
sorted, step 432 performs a process to create the extended 

20 character chain file 306 and the position information file 308 
shown in FIG. 4 from these intermediate files with known 
conventional merge/sort techniques, and to store them on the 
disk 108. In addition, since the character chain may repeat- 
edly appear several times in the original intermediate files, 

25 a process is performed here to put the entries of the same 
overlapped character chain into one, and to associate the 
related variable length chain number and the position num- 
ber in variable length chain to that character chain. 
Thereafter, the intermediate files are no longer necessary and 

30 are deleted. 

H. Search Process by Using Index File 

Now, an example of a process for performing a character 
string search by using the index file created as above will be 
described by referring to the flowchart of FIG. 7. In step 502, 

35 first, a process is performed to display, a dialogue box with 
an input box, and to prompt the user to input a search 
character string in the input box. When the user inputs the 
search character string in the input box, and clicks the OK 
button or a return key, the search character string is 

40 normalized, if desired. In step 504, a fixed length chain and 
a variable length chain of N characters are created based on 
the same rule when the index file is created from that search 
character string. 

In step 506, the fixed length chain is searched from the 

45 character chain file. In step 508, if it is determined that no 
fixed length chain is found, a message box is preferably 
displayed in step 526 for indicating that the search character 
string cannot be found, and the process ends. 

In step 508, if it is determined that a fixed length chain is 

50 found, since the position information file returns one or more 
document numbers and at least one position number in the 
document at that document number, this information is once 
stored in step 510 in a predetermined buffer region in the 
main memory or on the disk for the subsequent process. 

55 In step 512, it is determined whether all fixed length 
chains created from the search character string have been 
searched. If so, the process proceeds to step 514. If not, the 
process returns to step 506 where the search process is 
performed for the next fixed length chain by using the 

60 character chain file. 

In step 514, a variable length chain is searched from the 
extended character chain file and the extended position 
information file. In this case, when variable length chains 
having excess characters before or after it are eliminated, it 

65 is possible to avoid noise such as "car^"communication*\ 
Specifically, in case where there are three or more characters 
before or four or more characters after a matched character 
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string, it may be possible to eliminate these characters, to 
subtract a predetermined value from the similarity factor for , 
one character existing before or after the matched character 
string as a penalty, or to multiply the similarity factor by a 
predetermined value (positive value less than one). 

In step 516, if it is determined that no variable length 
chain is found, step 526 presents a message box indicating 
that no search character string is found, and ends the process. 
Although, in the preferred embodiment of the present 
invention, the message is displayed on the display device 
110 of FIG. 1, it may be possible to transfer the message to 
another location through a network. On the other hand, if it 
is determined in step 516 that a variable length chain is 
found, because the extended position information file 308 
returns one or more variable length chain numbers 316, the 
position information file 304 returns in step 518 one or more 
document numbers 312 for subsequent processing based on 
this information and at least one position number in docu- 
ment 314 at these document numbers which are then once 
stored in a predetermined buffer region in the main memory 
or on the disk. 

In step 520, it is determined whether all variable length 
chains created from the search character string have been 
searched. If so, the process proceeds to step 522. If not, the 
process returns to step 514 where the search is performed 
with the next variable length chain by using the extended 
character chain file, and the extended position information 
file. 

In step 522, a check is performed on the position infor- 
mation for the fixed length chains stored in the buffer in step 
510 and the position information for the variable length 
chains stored in the buffer in step 518 to store the document 
numbers containing the character strings matching the 
search character string and their position numbers in the 
buffer region. If it is determined that the search character 
string is found, the contents of the documents in the database 
202 are accessed in step 528 from these document numbers 
and the position numbers in the document, and the appli- 
cable lines of the document in which the document search 
character string exists are preferably displayed in the indi- 
vidual windows. If it is determined in step 524 that the 
search character string is not found, a message box indicat- 
ing that the search character string is not found is preferably 
displayed in step 526, and the process ends. 

In order to check that the search character string appears 
in a specific block in the document (for example, the third 
block), it is sufficient to count delimiter positions in the 
document which appear until reaching the position where 
the search character string appears in the document, to check 
at which block (x-th block) the search character string is 
positioned in the document, and to compare it with the 
specified block number. 
I. Ambiguity Search Process 

The process shown in FIG. 7 is to perform a so-called 
"strict search" by using the index file. However, according 
to the present invention, it is also possible to prepare a 
so-called "ambiguity search" process for character strings 
including those similar to a specified character string in 
individual documents in the database at a high speed by 
using an index file. Specifically, the ambiguity search 
scheme operates on a character string to be searched, and 
search accuracy (larger than zero but less than or equal to 
one) to identify documents including "similar character 
strings" of which the "similarity factor" with the character 
suing to be searched is higher than the specified search 
accuracy and positions in the document of the "similar 
character strings". 
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11. Using a Human-like Thought Process to Determine that 
Character Strings are Similar 

When English is a delimiter language, English character 
strings in which the arrangement of characters resemble 
5 each other include: 

(1) Different expression 
"database*', "data base", "data-base" 

(2) Inflection 

"communicate", "communication" 
10 (3) Typographical error 

In case of "communication": 
"communication": missing character 
"communication": reversal of characters 
"communication": excess character 
15 (4) Hyphenation 

"communication", "communi-cation" 
(5) Variation of phrase 
"new technology",' "hew CMOS technology" 
In the above examples, most of the characters continu- 
20 ously match, but there is a missing character or an excess 
character or some other minor variation. Similar examples 
may be found in Japanese, Korean, and Chinese. 

12. Rule for Determining Similar Character Strings and 
Similarity Factor 

2S First, description is provided for a case of a search 
character string consisting of only the fixed length chain 
according to a rule for determining similar character strings 
and a similarity factor. It is a general rule to collect character 
strings which have the same or a similar sequential relation- 

30 ship of characters in an input character string. To some 
extent, the selected character strings continuously match 
with the input character strings over M characters or more. 
From this, it is possible to determine a similarity factor from 
the number of matched characters and the number of non- 

35 matched characters. 

Terms used in this description are defined. 
Matched character string: 

A section in which a character string to be searched 
continuously matches the document text over M characters 

40 or more. The longest character string is selected from those 
starting from a same character. 
Example: 

45 

Character string to be searched: communication 
Document text: ... the communt- . . . 



If M-2, "communi" is the matched character string. In 
50 this case, because of the longest selection, "com" or 
"commu" is not referred to a matched character string. In 
addition, "t" is also not a matched character string because 
it is less than two characters. 
Valid matched character string: 
55 It is a matched character string of M characters consti- 
tuting a similar character string. A valid matched character 
string in a search character string is called a valid matched 
search character string; a valid matched character string in a 
document is called a valid matched document character 
60 string. Since the valid matched search character string and 
the valid matched document character string match for their 
contents, they are simply called a valid matched character 
string unless distinction is required. 
Longest non-matched character string length L: 
65 Non-matched characters to be contained in a similar 
character string should be continuous L characters. L is a 
constant of one or more. 
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Now, description is given on how to select a "similar character string/number of characters of the character 

character string" and how to digitize a "similarity factor". string to be searched, 

(1) Determination of first valid matched character string number of characters in a "similar character string" not 
The first matched character string in a document is to be belonging to the valid matched document character 

the first valid matched character string, where it is expressed 5 string/number of characters of the "similar character 

that string") 

s (D, i) is the start position of the i-th valid matched 13 • How to Count the Number of Characters in "Similar 

document character string; Character String" Belonging to Valid Matched Character 

e (D, i) is the end position of the i-th valid matched ^ iT j^ , 

document character string; 10 WheD characters correspond to the same characters m 

^ . . .« , me character string to be searched, the first character is 

s(C ;i) is the start position of the i-th valid matched search CQm{ed a$ ^ aQ * ^ Qne is coumed as 0 5 

character string; and Otherwise, one character is counted as 1. (Refer to Example 

e (C, i) is the end position of the i-th valid matched search 3 described later.) 

character string. 15 14. Order of Determination of "Similar Character String" 

(2) Determination of next valid matched character string The first "similar character string" is determined by 
When the i-th valid matched document character string starting comparisons from the top of document. When the 

has been determined, the (i+l)-th valid matched document i_ m « s i m iiar character string" has been determined, the 

character string is determined in the following manner. (i+l)-th "similar character string" is determined by starting 

The first matched character string satisfying the following 2 o comparisons from the first character which is behind a 

two conditions a) and b) is made the (i+l)~th valid matched character at the top of the i-th "similar character string", and 

character string. does not belong to a valid matched character string consti- 

FHf I ATTOM 7 mtiDg ^ Uh ' <similar character String". 

fcUUAilUN / A « similarity factor" similar to general human thought 

25 determinations of similarity can be calculated on whether or 

a) e(D, i)+i^s{D, i+i)$e(D, not tne arrangement of characters resemble each other by 

setting the constants L and M to suitable values. When the 
This means that up to L characters are allowed as excess "similarity factor" becomes the maximum value of 1, the 
characters which may exist between the i-th valid matched character strings completely match. When the character 
document character string and the (i+l)-th valid matched 30 strings completely match, the "similarity factor'' always 
document character string. becomes 1. 

(Refer to Example 2 described later.) 15. Process Fowchart for an Ambiguity Search 

The above process is represented as a flowchart shown in 
EQUATION 8 FIG. 8. Referring to FIG. 8, in step 602, input of a search 

35 character string is prompted. In step 604, input of similarity 

b) s(C, {+i)>c(c factors of 0-lis prompted. Usually, input of the character 

string and the value in steps 602 and 604 are performed by 
The process is continued until it is determined that such a using an input box and a scroll bar of a single dialogue box. 
valid matched document character string cannot be selected. In step 606, the number i for a valid matched character 

(3) Determmationof"simflarcharacterstring"andits"simi- 40 string is set to 1. In step 608, the valid matched character 
larity factor" (degree of similarity) string is searched. Now, if there is a condition that the length 

When the valid matched document character string cannot of valid matched character string is M characters or more, it 
be selected any more, a "similarity factor" is calculated from is advantageous that an index file of M-character chains is 
the following equation by assuming that a "similar character created in the process of FIG. 7. This is because, if such 
string" is from the first character of the first valid matched 45 index file previously exists, search for any M-character 
character string to the last character of the last valid matched chain can be performed at a high speed by the dichotomized 
character string. searching of the index file. Subsequently, search for an 

M-character chain is performed in the index file by shifting 
EQUATION 9 me start position for taking the M-character chain in the 

Similarity factor* 50 index character string by one. Then, if the resulting docu- 

minimum(Dumberofcharactersi Q acharacterstringtobc f/ n ' amber . '* the *T e as ° ne P rev !° us "j™* for lhe , 

searched belonging to a valid matched search character M-character cbam and the pos.t.on number m the ^ocument 

string/number of characters of the character string to be ,s f s f? u 1 emial ' l * vahd matched character stnng with a length 

searched would be obtained. Thus, whenever the condition 

' 55 that the document number is the same as one previous search 

number of characters in a "similar character string" for me M-character cham and mat the position number in the 

belonging to the valid matched document character documcnt ^ sequential are satisfied, the length of valid 

strmg/number of characters of the "similar character matched ^ string ^ iacrcmeatc d by one. However, if 

string ) nothing is found in the search for M-character chain using 

Tne similarity factor can be calculated from the number of 6Q ^ mdex me> or ^ the document number being returned 

characters not belonging to a valid matched document does not match> or ^ the position num ber in document 

character string. becomes non-sequential, the end position of valid matched 

EQUATION 10 character string would be found. 

Sometimes, no valid matched character string is found. In 

Similarity factor=l- 65 such a case, depending on the decision in step 610, the 

maximum (number of characters in a character string to process proceeds to step 626, where it indicates that nothing 

be searched not belonging to a valid matched search is found and ends. When it is determined in step 610 that a 
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valid matched character string is found, the process proceeds 
to step 612, and s (D, i) to e (D, i) in the document and s (C, 
i) to e (C, i) in the search character string are marked as. the 
valid matched character string. 

In step 614, the (w-l)-th valid matched character string 
satisfying the conditions 

a) e(D, »>1S 5 (Z>, i+l)$e(D t i>L+l, 

and 

b) s{C f CMC Q-(M-1) 

is searched using the index file. If found, the process returns 
to step 612 where, for the (i+l)-th valid matched character 
string, s (D, i+1) to e (D, i+1) in the document and s (C, i+1) 
to e (C, i+1) in the search character string are marked as a 
valid character string (increment of i is indicated in step 
618). 

On the other hand, if a valid matched character string is 
not found in step 616, a similarity factor is calculated in step 
620. The similarity factor is determined by the exemplary 
condition. 

Similarity factor= 

minimum (number of characters in a character string to be 
searched belonging to a valid matched character string/ 
number of characters of the character string to be 
searched, 

number of characters in a "similar character string*' 
belonging to the valid matched character string/number 
of characters of the "similar character string") 

In this case, the "similar character string" is a character 
string from the start position of the first valid matched 
character string in the document to the last position of the 
last valid matched character string. 

In step 622, results are selected from the similarity factor 
calculated in step 620 and that input in step 604. Only results 
with the similarity factor equal to or higher than that input 
in step 604 are displayed in step 624. 

In step 624, a process is performed to access contents of 
the documents stored in the database based on the document 
numbers and the position numbers in the document returned 
as the result of searches of the index file in steps 608 and 
614, and to display lines containing applicable sections. 

Although the "similar character string" for one search 
character string may be simultaneously found in a plurality 
of documents, it may be found at a plurality of sections even 
in a single document. Accordingly, it should be noted that 
steps 606-622 are applied to each of such plurality of 
"similar character strings", and, in step 624, only those of 
the plurality of "similar character strings" satisfying the 
conditions for similarity factor are selected and displayed. 
16. Examples of Determination on "Similar Character 
String" and Similarity Factor 

Examples are given with M =2, and L =3. 

EXAMPLE 1 



123456 

Character string to be searched C: ABCDEF 

12345678 . . . 
Document D: AB.CD.EF . . . 



Since the longest character string first matched is "AB", the 
first valid matched character string is "AB" 
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s{C, 1)=1 e(C,l) = 2 
s{D, 1)=1 e(DA)= 2. 

5 

Since e (C, 1)-(M-1)=1, the second valid matched char- 
acter string is searched by comparing a character string 
starting at or after the second character in the character string 
to be searched with a character string starting at the third, 
fourth, fifth or sixth character in the document (because e (D, 

1) +l-3, and e (D, l)+L+l-6). 

Second valid matched character string "CD" 

5(C,2)=3 <?(C,2)=4 

15 

5{D,2) = 4 *(Z>,2) = 5 

Since e (C, 1)-(M-1)=3, the third valid matched character 
20 string is searched by comparing a character string starting at 
and after the fourth character in the character string to be 
searched with a character string starting at the fifth, sixth, 
seventh or eighth character in the document (because e (D, 

2) +l=5, and e (D, 2)+L+l=8). 

25 Third valid matched character string "EF" 

4C,3)=5 *(C\3) = 6 
5(D,3) = 7 *(Z>,3) = 8 

30 

Since the end of the character string to be searched is 
reached, the third is the last valid matched character string. 

TABLE 4 

35 

AB CD EF 
12 3 
AB • CD • EF . . . 
1 2 3 



40 

Numerals are the number of valid matched character 
string. 

Therefore, the "similar character string" is 
"AB#CD#EF" from s (D, 1) to e (D, 3). 

4S "Similarity factor' -minimum (6/6, 6/8>6/8«0.75 

EXAMPLE 2 

TABLE 5 

50 1234 

Character string to be searched C: ABCD 



TABLE 6 

123456789 101112131415 
Document D: A B X X X X C D XXXXXX. 



Since the longest character string first matched is "AB", 

60 



First valid matched character string is "AB" s (C, 1) » 1 e (C, 1) ° 2 

s(D,l)ol e(D,l)-2 

65 

The second valid matched character string is searched by 
comparing a character string starting at the third, fourth, fifth 
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or sixth character in the document (because e (D, 1)+1«3, 
and e (D, 1)+L+1=6) with a character string starting at or 
after the second character in the character string to be 
searched (because e (C, 1)-(M-1)=1). 

Since the second valid matched character string is not 
found, and the end of the character string to be searched is 
reached, the valid matched character string is only the first 
one. 



34 



TABLE 10 



ABC 
1 

2_ 

XAB BCXXXX 
1 2 

1,1, 0.5,l->3.5 



TABLE 7 



ABCD 
1 

ABXXXXCDXXXXXX. 
1 



Therefore, the first "similar character string" is "AB" from 
s(D, l)toe(D, 1). 

"Similarity factor' -minimum (2/4, 2/2>2/4=0.5 

The first non-valid matched character after "A" is "X". 
When the second "similar character string" after "X" is 
searched: 

TABLE 8 



ABCD 
1 

ABXXXXCDXXXXXX. 

1 



However, since "AB" and "CD" are separated by four 
characters in the document, and L-3 in this example, the 
above "CD" is not considered as a valid matched character 
string. 

EXAMPLE 3 
TABLE 9 



123 

Character string to be searched C: ABC 

123456789 
Document D: XABBCXXXX 



Since the longest character string first matched is "AB": 
EQUATION 11 



First valid matched character string is "AB" s (C, 1) - 1 e (C, 1) - 2 

s(D,l)-2 e(D,l)-3 



10 The "similar character string" is "ABBC" from s (D, 1) to 
e (D, 2). "Similarity factor-minimum (3/3, 3.5/4)-3.5/4- 
0.875 

18. Search of Variable Length Chain 

is The method for searching a similar character string is 
described in the above for a case of a search character string 
consisting of only fixed length chains. When this is extended 
to a search character string containing variable length 
chains, the search proceeds as follows. Here, a variable 
20 length chain taken out from the search character string is 
called an extended search character string. 

First, the following extended character chain is obtained 
by searching for an extended character chain file, and an 
25 extended position information file. The searching method is 
the same as the method for searching a search character 
string consisting of only fixed length chains from a character 
chain file and a position information file. M' is used as a one 
corresponding to the constant M. 
30 (1) Searching a variable length chain matching an extended 
search character string with a specified search matching 
factor or higher. In this case, eUminating variable length 
chains not matching the first character in the extended search 
character string is effective to reduce noise. In this case, in 
35 creating an extended fixed chain, high speed processing can 
be performed by using a symbol such as indicative of a 
start, creating an extended fixed chain of "$co", and elimi- 
nating variable length chains not matching it. However, even 
^ when "$" indicative of start is not used, it is possible to 
identify a start position of the variable length chain from a 
position number in variable length chain or information on 
the delimiter in the extended position information file. 

45 EXAMPLE 4 

Extended search character string: "communication" 

Found variable length chain: "communication" 

(2) Searching a variable length chain matching an extended 

50 search character string which is newly created by joining 
extended search character strings with a specified search 
matching factor or higher 



20 



25 



The second valid matched character string is searched by 
comparing a character string starting at or after the second 55 
character in the character string to be searched (because e 
(C, 1)-(M-1)=1) with a character string starting at the 
fourth, nfth, sixth or seventh character in the document 
(because e (D, 1)+1=4, and e (D, 1)+L+1=7). 



Second valid matched character string is "BC* s(C, 2) = 2 e(C, 2)-= 3 
s(D,2)-4e(D,2)-5 

65 

Since the end of the character string to be searched is 
reached, there are two valid matched character strings. 



EXAMPLE 5 
Search character string: "data-base" 
Extended search character string: "data", "base" 
Extended search character string created by joint: "database" 
It may be possible to set the number of joints in a string 
up to two or three. This process enables the search to locate 
the joined "database" from character strings in a document 
even in a case of a divided search character string such as 
"data base". 

(3) Searching variable length chains satisfying all the fol- 
lowing conditions from variable length chains matching 
extended search character string with a matching factor 
larger than 0. 
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The first character in variable length chain is included in 
the matched section. 

TABLE 11 
[Example 61 (underline under matched section) 
Extended search character string Variable length chain 



" datab ase" 
"database" 
"database" 



-> o 

-> X 

-> o 



" data " 
"u pdate " 



10 



The first or last character in extended search character 
string is included in the matched section. 
In this case, the process can be performed at a high speed by 
creating an extended fixed chain of "$co" and "se¥" through 
use of "$" indicative of a start and indicative of an end 
in creating the .extended fixed chain, and by eliminating 
variable length chains not matching either of them. 



TABLE 12 



[Example 7] (underline under matched section) 
Extended search character string Variable length chain 



" datab ase" 
"data base" 



-■> o 
-> o 

-> X 



"data" 



"tab." 



TABLE 13 



[Example of reversed characters] 



"commit nication" 



15 



20 



25 



This process enables the location of the divided "data" 
and ''base" from character strings in a document even in a 
case of a joined search character string such as "database". 
When a variable length chain contains non-matched char- 
acters in the number equal to or more than a predetermined 
number of characters, such as when the first character in the 
extended search character string is included in the matched 
section (the variable length chain "data" for the extended 
search character string "database" being an example), or 
when the last character in the extended search character 
string is included in the matched section (the variable length 
chain "base" for the extended search character string "data- 
base" being an example), such variable length chains may be 
excluded from the subject for search. This enables the 
method to increase the search speed by narrowing down the 
subjects for search. 

In the preferred embodiment of the present invention, the 
variable length chain subject to the processes (1), (2) and (3) 
described above (that is, (l)+(2)+(3)) becomes the "variable 
length chain satisfying the conditions" in step 708. (FIG. 9) 
However, the conditions may be variously changed to be (1) 
only, (3) only, (l)+(2) or the like through setting instead of 
the conditions (l)+(2)+(3). 

In determining the matching factor for searching an 
extended index by looking for the variable length chains of 
(1), (2) and (3), if an evaluation lower than normal matching 
but higher than normal non-matching is given due to rever- 
sal of characters, it is effective to search a word in which 
position of characters is reversed, which is often found in a 
typographical error of an English word. 



55 



60 



In this example, a non-matched character string in the 
search character string is "un", while a non-matched char- 
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acter string in the document character string is "nu". In such 
a case, such non-matching caused by typographical error can 
be detected by determining, for example, whether or not "u" 
and "n" of the non-matched character string in the search 
character string are contained in the non-matched character 
string in the document character string. 

The search method for a search character string contain- 
ing such a variable length chain is performed in the proce- 
dure shown in FIG. 9. This procedure is described with 
reference to an embodiment of the invention. In this 
embodiment, search characters are "data communication", 
and "data . . . communication", "data communication", and 
"daily communication" exist in a document character string. 
It is assumed that "data" and "communication" in "data . . 
. communication" are sufficiently separated. 

Information on a character chain file and on a position 
information file are assumed to be as follows (M'=3): 



30 



35 



40 



45 



50 



Character chain fUe 


Position information file 


1. data 


1-1, 2-1 


2. daily 


3-1 


3. communication 


2-6 


4. communication 


1-35, 3-7 


Extended character 


Extended position 


chain file 


information file 


$da 


1-1, 2-1 


dat 


1-2 


ta¥ 


1-3 


a¥ 


1-4 


dai 


2-2 


ail 


2-3 


fly 


2-4 


ly¥ 


2-5 


y* 


2-6 


$co 


3-1, 4-1 


com 


3-2, 4-2 


omm 


3-3, 4-3 


noil 


3-4 


min 


3-5 


inu 


3-6 


auc 


3-7 


uca 


3-8 


cat 


3-9, 4-9 


ati 


3-10, 4-10 


do 


3-11, 4-11 


ion 


3-12, 4-12 


on¥ 


3-13, 4-13 


n¥ 


3-14, 4-14 


mmu 


4-4 


mun 


4-5 


uni 


4-6 


nic 


4-7 


tea 


4-8 



Here, for the purpose of easy understanding, an unsorted 
file is shown. Here, in "1. data 1-1, 2-1", "1." indicates the 
variable length chain number, "data" indicates a variable 
length chain, and "1-1" and "2-1" indicate document 
number-position number in the document. In addition, in 
"$da 1-1, 2-1", "$da" indicates an extended character chain, 
and "1-1" and "2-1" indicate variable length chain number- 
position in the variable length chain. Accordingly, "$da 1-1, 
2-1" represents the first character in a variable length chain 
number 1 (data) and the first character in variable length 
chain number 2 (daily). 

When the procedure of FIG. 9 is started, a search char- 
acter string "data communication" is input (step 702). Then, 
a similarity factor is input (step 704). It may be possible to 
set this similarity factor to a default, and to omit its input. In 
this example, here, a similarity factor of 0.80 is used. 

Then, a fixed length chain and a variable length search 
character string are created from the search character string 
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(step 706). In this example, since the description is made on 
a document consisting of only a delimiter language, there is 
no fixed length chain. 

Variable length chains satisfying the conditions are 
searched from the extended character chain file and the 
extended position information file (step 708). Here, a pro- 
cess is performed for searching variable length chains 
according to the processes (1), (2) and (3). That is, (1) search 
is performed for a variable length chain matching an 
extended search character string with a specified search 
matching factor or higher. While the search matching factor 
in this case may be set to the same value as the similarity 
factor for the entire character string, it is preferable to be 
lower than the similarity factor for the entire document. 

For example, since "communication" and "communica- 
tion" match for 10 characters in 13 characters, and three 
characters do not match, the similarity factor is 10/13-0.77 
. using a simple calculation method for the similarity factor. 
(Here, for easy understanding by the reader, description is 
made by using the simple calculation method for similarity 
factor. However, other calculations may also be used). The 
similarity factor between "data communication" and "data 
communication" is 15/18=0.83 in the simple calculation 
method of similarity because they match for 15 characters in 
18 characters including delimiter, and three characters do 
not match. Thus, there are many cases where the similarity 
factor for entire character string becomes higher even if the 
similarity factor between the variable length chains is low, 
and this tendency becomes more significant when the char- 
acter string becomes longer. 

In this embodiment, the "specified search matching fac- 
tor" is set to 0.60 for "data" and 0.72 for "communication". 
The "specified search matching factor" may be changed 
depending on the ratio of the number of characters in the 
search character string to that in the variable length chain. 
This is because, as shown in this example, while the simi- 
larity factor for an entire character string does not become 
0.80 or higher unless the variable length chain matches 
"communication" for 10 characters or more, there is a 
possibility that, for "data", the similarity factor for entire 
character string becomes 0.80 or higher even if only one 
character matches. However, if too low a matching factor is 
allowed for "data", the number of matched variable length 
chains increases and the search speed is affected so that 0.6 
is set as the lower limit. In addition, in the preferred 
embodiment of the present invention, variable length chains 
not seriously affecting the similarity factor for the entire 
document are excluded from the subject for variable length 
chain searching of (1). This enables the present search 
method to improve the search speed. 

In the embodiment, a high speed search is made possible 
by excluding variable length chains which have a number of 
characters less than the similarity factorxnumber of charac- 
ters of variable length chain to be searched (=0.72xl3»9.36) 
from the subject for search. This is attained by controlling 
the number of characters of the variable length chain as 
follows: 



Extended character 


Extended position 


chain file 


information file 


$da 


1-1-4,2-1-5. 



"1-1-4" and "2-1-5" in "$da 1-1-4, 2-1-5" indicate "variable 
length chain number-position in the variable length chain- 
number of characters in the variable length chain". 
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As variable length chains having the search matching 
factor of 0.60 or higher for "data" and 0.72 or higher for 
"communication", "data" (100%) can be found for "data", 
and "communication" (77%) and "communication" (100%) 
5 can be found for "communication" by a method similar to 
that for a fixed length chain. 

In the case of a search character string "communication", 
for example, chains matching a corresponding extended 
search character string are the following in the extended 
_ character chains: 



15 



20 



Extended character 


Extended position 


chain file 


information file 


$co 


3-1, 4-1 


com 


3-2, 4-2 


omm 


3-3, 4-3 


mmi 


~ 3^4 " 


min 


3-5 


inu 


3-6 


nuc 


3-7 


tica 


3-8 


cat 


3-9, 4-9 


ati 


3-10, 4-10 


tio 


3-11, 4-11 


ton 


3-12, 4-12 


on¥ 


3-13, 4-13 


n¥ 


3-14, 4-14 



Then, "3. communication" and "4. communication" in the 
character chain file can be found from the information on the 

30 extended position information file. 

Then, (2) search is performed for a variable length chain 
matching an extended search character string which is newly 
created by joining extended search character strings with a 
specified search matching factor or higher. Therefore, vari- 

35 able length chains matching the search character string 
"datacommunication" with a specified search matching fac- 
tor or higher are searched. In this case, while the "specified 
search matching factor" may be set to the same value as the 
similarity factor for entire character string, it may be lower 

40 than the similarity factor for the entire document. For 
example, when the search character string is "data base 
system", there exist three extended search character strings 
created by the joining: "database", "basesystem" and "data- 
basesystem". Impact of these joined search character strings 

45 on the similarity factor for the entire document varies 
depending on the number of characters in the joined search 
character strings. In the embodiment, because the "specified 
search matching factor" of (2) is set to 0.80, there is no 
variable length chain matching the search character string 

50 "datacommunication" with the specified search matching 
factor or higher. 

Then, variable length chains satisfying the conditions of 
"1. the first character of variable length chain is contained in 
the matched section", and "2. the first or last character of 

55 extended search character string is contained in the matched 
section" are searched in (3) variable length chains matching 
the extended search character string with the matching factor 
of 0 or higher. The variable length chains meeting these 
conditions are "data", "communication" and "communica- 

60 tion". 

Referring to FIG. 9 again, in step 710, it is determined 
whether or not a variable length chain is found. In the 
embodiment, "1. data", "3. communication" and "4. com- 
munication" have been found. Numbers of these variable 
65 length chains are stored in a buffer (step 712). In the 
embodiment, the variable length chain numbers 1, 3, and 4 
are stored. In the preferred embodiment of the present 
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invention, the variable length chain character strings "data" 
and "communication" of the search character string "data 
communication" are respectively assigned a number 
(variable length chain search character string number). The 
variable length chain numbers 1, 3 and 4 are stored in 
connection with such numbers. Therefore, the information 
being stored is (1-1) and (2-3, 4). Here, (2-3, 4) indicates the 
variable length chain numbers 3 and 4 of the variable length 
chains in the document matching the variable length chain 
search character string number 2 with a certain matching 
factor or higher. 

Then, in step 714, the similarity factor for the entire 
character string is calculated from the positional relationship 
between the position information of the fixed length chain 
and the position information of the variable length chain. 
Specifically, as described above, the variable length chain 
numbers 1, 3 and 4 in the document are stored for the 
variable length chain search character"" string number. 
Therefore, the variable length chains in the document which 
may be joined are 1-3 and 1-4 ("1." corresponds to "data", 
"3." to "communication", and "4." to "communication"). 

Since, when the contents of the character chain file 302 
and the position information file 304 are referenced, 



Character chain file 


Position information file 


1. data 


1-1, 2-1 


3. communication 


2-6 


4. communication 


1-35, 3-7, 



combinations of (1-1, 2-l)-(2-6), and (1-1, 2-l)-(l-35, 3-7) 
become candidates. More particularly, combinations of 

(1-1M2-6), 

(2-1M2-6), 

(1-1M1-35), 

(1-1X3-7), 

(2-l)-(l-35), and 

(2-1M3-7) 
become candidates. 

However, cases where (1) document numbers are 
different, where (2) condition L=3 is not satisfied, and where 
(3) position numbers in the document are reversed are 
excluded from the candidates for the calculation of similar- 
ity factor. The combination of variable length chains meet- 
ing such conditions is only (2-l)-(2-6). Therefore, "data 
communication" is calculated for the similarity factor. The 
conditions where "2. condition L-3 is not satisfied", and 
where "3. position numbers in document are reversed" are 
employed for a case where order is important as in "data 
communication", but are not employed for a case where 
order is not important as in searching for character strings 
which are extraction of keywords (variable length chains) in 
the abstract from a patent specification. 

In the above description, in step 712, the variable length 
chain numbers are stored in connection with the variable 
length chain search character siring numbers. However, 
storing the variable length chain search character string 
numbers is not an essential component of the present inven- 
tion. Combinations of variable length chains can be deter- 
mined without information on the variable length chain 
search character string numbers. 

This is specifically described. The contents of the char- 
acter chain file 302 and the position information file 304 are 
referenced from the stored variable length chain numbers in 
step 712. 



,323 

40 





Character chain file 


Position information file 


5 


1. data 


1-1, 2-1 




3. communication 


2-6 




4. communication 


1-35, 3-7 



The character chain file is divided according to the contents 
10 of the position information file. 



15 



20 



25 



30 



Character chain file 


Position information file 


1. data 


1-1 


1. data 


2-1 


3. communication 


2-6 


• 4. communication 


1-35 - - 


4. communication 


3-7 


Then, the character chain file is sorted according to the 


contents of the position information file. 


Character chain file 


Position information file 


1. data 


1-1 


4. communication 


1-35 


1. data 


2-1 


3. communication 


2-6 


4. communication 


3-7 



If L=3, it is found that, in view of the contents of the position 
information file, 
35 1. data 1-1, 

4. communication 1-35, and 
4. communication 3-7 
have no other variable length chains to be combined. 
On the other hand, because 
40 1. data 2-1, and 

3. communication 2-6 

satisfy the condition L =3, they are combined and deter- 
mined for the similarity factor. 
Therefore, the candidates for the calculation of similarity 
45 factor are: 

1. data 1-1, 

4. communication 1-35, 

4. communication 3-7, and 
50 1-3. data communication 2-1 

To prevent duplicated calculation of similarity factor, they 
are arranged by the variable length chain number. 
1. data 1-1 

4. communication 1-35, 3-7 

55 1-3. data communication 2-1 

Then, character strings less than search character string (18: 
including delimiter)xsimilarity factor (0.80) (character 
strings less than 14.4) are excluded from the subject for 
calculation of similarity factor. (In practice, it is desirable to 

60 perform the calculation before they are arranged by the 
variable length chain number). Accordingly, the candidates 
for the calculation of similarity factor become only 
1-3. data communication 2-1. 

Since the search character string "data communication" 
65 matches "data communication" for 15 characters of 18 
characters, and three characters do not match, the similarity 
factor between them is calculated as 0.83. 
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The calculation of similarity factor for character string in 
step 714 can be reduced by storing the similarity factor of 
variable length chain together with the variable length chain 
numbers in step 712. That is, (1-1.00), (3-0.77) and (4-1.00) 
are stored in step 712 and utilized. 

In the simple calculation method of similarity factor, the 
similarity factor is possible to be calculated from the fol- 
lowing equation. 

Similarity factor=(number of characters in variable length 
chain lxmatching factor of variable length chain 1+number 
of characters in variable length chain 2xmatching factor of 
variable length chain 2+ . . . +number of characters of 
delimiter)/(number of characters in search character string) 

Accordingly, the similarity factor of the embodiment is 
(4xl.00+13x0.77+l)/18=0.83. Whether or not the delimiter 
is counted as one character may be changed by design. 

Then, in step 716, the input similarity factor is compared 
to the calculated similarity factor. If there exists no character 
string with similarity factor higher than the input similarity 
factor, a display indicating that none is found is displayed on 
the display 110 (step 718). If it is found, applicable line(s) 
in applicable documents) is displayed (step 720). However, 
the display indicating that none is found and the display of 
applicable line(s) in applicable documents are not essential 
components, and this information may be transmitted to 
another computer (including a client). In addition, the dis- 
play of applicable line(s) in applicable documents) may 
display all character strings with the input similarity factor 
or higher together with their similarity factor, document 
numbers, position numbers in document and the like, or to 
display only predetermined numerals. The order of display 
may be the sequence of appearance in the documents) or in 
the descending order of similarity factor. Moreover, in case 
of multiple documents, it may be possible to display prede- 
termined numerals for character strings satisfying conditions 
in each document. This can be set variously in the design 
stage. 

While an approach has been described for use in the 
ambiguity search, the present invention may also be applied 
to spell checking of words in a document. In this case, a 
word in the document not existing in a dictionary is detected 
by the conventional approach. Then, the detected word not 
existing in the dictionary is used as a search character string, 
and the ambiguity search is performed for words existing in 
the dictionary. Then, words with a certain similarity factor or 
higher in the ambiguity search are displayed as candidates to 
correct the spelling for the word not existing in the dictio- 
nary. 

Search for character strings consisting of only variable 
length chains has been described above. For a document in 
which variable length chains and fixed length chains are 
intermixed, the similarity factor for the entire character 
string is calculated in step 714 from the positional relation- 
ship between the position information for the fixed length 
chain and the position information for the variable length 
chain. This process is described by example. In the 
embodiment, there exists a search character string 
"ASEAN123" while "ASEA012" exists in the document. 
"ASEAN" and "ASEA" are variable length chains. 
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In the example, the contents of the character chain file 302 
and the position information file 304 are: 



5 



Character chain flic 


Position information file 


l.ASEA 


1-1 


2. 01 


1-5 


3. 12 


1-6 



10 

A variable length chain document character string "ASEA" 
similar to the variable length chain search character string 
"ASEAN" has been found by the above-mentioned method. 
For this character string, detection of a valid matched 
15 character string and calculation of similarity factor are 
performed in a similar manner to the method described for 
a fixed length chain. 



EXAMPLE 8 

20 

TABLE 14 





12345678 


Character string to be searched C: 


ASEAN123 


1234567 


Document D: 


ASEA012 


TABLE 15 


ASEAN123 




1 2 




ASEA012 




1 2 




Similar character string = "ASEA012" 




Similarity factor ° minimum (6/8, 6/7) 


-0.75 



35 



As described for the calculation of the similarity factor for 
a character string containing only variable length chains, the 
result of calculation of the similarity factor for variable 
40 length chains may be used for the calculation of the simi- 
larity factor for the entire character string. In the simple 
calculation of similarity factor, the similarity factor may be 
calculated from the following equation. 

45 EQUATION 12 

Similarity factor=(number of characters in variable length 
chain lxmatching factor of the variable length chain 
1+number of characters in valid matched character string of 
fixed length chain+number of characters of delimiter)/ 

50 number of characters in search character string 

Accordingly, the similarity factor of the search character 
string in the embodiment is (5x0.80+2+0)/8=0.75, and the 
similarity factor of the document character string is 

55 (4x0.80+2+0)/7-0.74. Thus, the similarity factor of the 
search character string is: 

Similarity facto r^minimum (0.75, 0.74)=0.74 

In this calculation of similarity factor, the calculation may be 
50 performed by changing weight for the variable length chain 
and the fixed length chain. For example, 

EQUATION 13 

Similarity factor=(number of characters in variable length 
65 chain lxmatching factor of variable length chain 1x0.5+ 
number of characters in valid matched character string of 
fixed length chain+number of characters of delimiterx0.2)/ 
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N: Number of characters in character chain to be stored in 
index 

M: Shortest leagth of valid matched character string in 

ambiguity search 
L: Longest length of non-valid matched character string in 

"similar character string" in ambiguity search 
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(number of characters in search character stringx0.5+ 
number of characters in fixed length chain in search char- 
acter string+number of characters of delimiter in search 
character stringx0.2) 
E9. Application to Search Equation 

The above-mentioned search is one example for a fixed 
character string. A case where it is applied to a search 
equation is described. For example, in a search equation 
such as 

(computer OR system) AND communication 
(a search equation searching for a document containing 
"computer" or "system", and containing "communication"), 
it is conceived to perform an ambiguity search by specifying 
a search matching factor for each search character striDg. 
When the search is performed for every search character 
string with the matching factor of 80% or higher, for 
example, the following documents may be found: 

a -document containing "computer" and 
"communication", and 

a document containing "sys-tem" and "communication" 
In addition, when the found documents are arranged in 
descending order of the documents most likely to be the 
desired documents to be located, it is possible to use the 
matching factor obtained as the result of search as a cue. 
E10. Relationship Between Structure of Index and Search 
for "Similar Character String" 

Ambiguity search for a "similar character string" can be 
attained at a considerably high speed with the structure of an 
index file according to the present invention by suitably 
determining the value of M. 
Determination of constants N and M 

TABLE 16 



10 



15 



string in the document 1, and "AB" is that in the document 
2. It would be unnatural to consider that the degree of 
similarity is higher for "AB" (document 2) than 
"ABXXXCD" (document 1), because document 1 contains 
an additional matched character string of "CD". Therefore, 
it would not be expected for document 1 to be evaluated to 
have a lower degree of similarity. It is more natural to 
consider that either the degree of similarity for 
"ABXXXCD" is higher than " AB", or the similar character 
strings in the document 1 are both "AB" and "CD". 

Now, the process of the second embodiment is described. 
Referring to the flowchart of FIG. 8, in this embodiment, 
steps 602-612 are the same as before, and the process for 
step 614 for indicating the conditions in searching the 
(i+l)-th valid matched character string is changed as fol- 
lows: 



EQUATION 14 



20 



siQ i+l)>e(C, ( >(Af-l) 
s(D f M)>e(D, i) 



(Equation A) 
(Equation B) 



and 



25 



s(D, f+l)-e(A 0-l+max(e(C, i)-s(C, *+!)+!, 0)§L (Equation C) 



Although, if the values of N and N' are increased, the 
number of types of character chains increases, volume of 
data decreases per one character chain, and the search can be 
performed at a higher speed, the capacity of the index file 
increases. Sufficient search speed is obtained at N=2 and 
N'o3 for average documents in Japanese, Chinese, Korean, 
and English. 

Id addition, if M and M 1 are determined to be M^N and 
M'^N', sufficient search speed is obtained in the ambiguity 
search. In view of the fact that the smaller M and M' are, the 
. finer search can be attained, it is believed to be desirable to 
set M=N and M'=N\ 

Ell. Second Embodiment for Determining Similarity Factor 
The ambiguity search of the second embodiment is par- 
ticularly considered for equilibrium between the principles 
that (1) "the more number of non-matched characters is 
inserted, the less similarity one feels" and (2) "when too 
many non-matched characters are inserted, then it cannot be 
felt to be one character string". When a character string 
matching an input character string, a non-matched character 
string and a matched character string are arranged in a 
document, it is unnatural that the degree of similarity is 
lowered when the character strings up to the latter matched 
character string are taken as a similar character string. For 
example, when the input character string is "ABCD", and 
the document 1 contains "ABXXXCD", and the document 
2 contains "AB", ""ABXXXCD" is a similar character 



Definitions of s (C, i), e (C, i), s (D, i), e (D, i) and the like 
are the same as above. 

30 Equation A means that duplicatively appearing characters 
are allowed up to M-l characters. Otherwise, character 
strings appearing in the same order as that of characters in 
the input character string are made valid. 

Equation B means that valid matched character strings do 

35 not overlap each other in the document. 

Equation C means that inserted non-matched characters 
and duplicatively appearing characters are allowed up to L 
characters together. 

In this embodiment, instead of calculating the ratios 

40 regarding the number of characters of a valid matched 
character and similar character strings in the document, and 
selecting the smaller one as the similarity factor as in the 
previous embodiment, the similarity factor is calculated by 
giving marks to similar character strings and dividing these 

45 assigned marks by the full mark (mark when string is 
completely matched). The mark for similar character string 
is calculated by giving a mark to each character under the 
following rule, and adding them. Accordingly, the process in 
step 620 of FIG. 8 becomes as follows. 

50 



60 



Character belonging to the first valid 


1 point 


matched character string 




Character belonging to the i-th (i > 1) 




valid matched character string, and 




posit ion in search character string_t c (C, i-1) + 3 


3 point 


(Equation D) 




positbn in search character string e (C, i-1) 


-V(2*L) point 


(Equation E) 




Character not belonging to valid matched character 


-1/L point 


string 





Also in this embodiment, when the i-th similar character 
string has been determined, the (i+l)-th similar character 
string is determined by starting comparisons from the first 
character after the top character in the i-th similar character 
string and not belonging to the valid matched character 
string constituting the i-th similar character string. 
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The negative point for the character not belonging to the 
valid matched character string is set by taking into account 
the equilibrium between "the more number of non-matched 
characters is inserted, the less similarity one feels" and "too 
much non-matched characters are inserted, then it cannot be 
felt to be one character string". The maximum total of 
negative points for one non-matched character string is 
1/L*L-1, and the minimum positive point is N^l when 
taking in the next matched character string (2 is particularly 
recommended for Japanese). Thus, the negative points never 
exceed the positive points. In addition, Equation E corre- 
sponds to point deductions for a duplicatively appearing 
character, while Equation D indicates a simple matched 
character, not a duplicatively appearing character. A case 
where a character duplicatively appears is accommodated by 
giving to a character expressed by Equation E a negative 
mark smaller than that-for a simple non-matched character. 
E12. Example of Determination of Similar Character String 
and Degree of Similarity in the Second Embodiment 

An example is shown also for N=2, and L=3. 

TABLE 17 



[Example 91 

Input character string C: 

Part of document D: ... 



123456 
APCDEF 
12345678 
AB CD • 



10 









TABLE 19 


[Example 10] 






1234 


Input character string C: 


ABCD 




1234567891011121314 


Part of document D: ... 


ABXXXXCDXX X X X X . ... 



Since the first matched character string is "AB", the first 
valid matched character string is "AB". Since the next 
matched character string "CD" fails to satisfy Equation D, 
the valid matched character string is only the first one. 



TABLE 20 
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C: 
D: 



ABCD 
1 

ABXXXXCDXXXXXX. 
1 



The similar character string is "AB". The degree of 
similarity =2/4=0.5 

The first non-valid matched character after "A" is "X". 
The second similar character string is searched after "X". 



TABLE 21 



ABCD 
1 

ABXXXXCDXXXXXX. 



Since the first matched character string is "AB", the first 
valid matched character string is "AB". 



30 



Thus, the second similar character string is "CD". 



s(C, 1) = 1 e(C, 1) = 2 
s(D, 1) = 1 «(D,1) = 2 



[Equation 15] 



According to Equations A, B and C, the second valid 
matched character string is "CD". 



TABLE 22 
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[Example 11] 






1234567 


Input character string C: 


ABCDEFG 




12345678 


Part of document D: ... 


ABCCDEFG ... 
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5(C,2) = 3 e(C,2) = 4 
s{D,2) = 4 e(D,2)=5 



[Equation 16] ^ e va lid matched character strings are two of "ABC" and 
"CDEFG". 



TABLE 23 



According to Equations A, B and C, the third valid 
matched character string is "EF". 



5(C,3) = 5 e(C3) = 6 
5(A3) = 7 e[D. 3) =8 



[Equation 17] 
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C: ABCDEFG 
1 

D: ABC 
1 

1,3,1, 



2 

CDEFG 
2 

1444 



Since the end of the input character string is reached, the 
valid matched character strings are three. 

TABLE 18 



C: AB_ £2 EE 
123 

D: A£ - £Q E£ 
1 2 3 

Points 14, 14, 1,1, 
-V6 



The similar character string is "AB#CD#EF" from s (D, 1) 
to e (D, 3). The degree of similarity=((l*6+(-l/3)*2)/6)= 
0.88 



The similar character string is "ABCCDEFG", and the 
second "C satisfies Equation E. Thus, the degree of 

55 similarity=((l*7+(-l/6)n)/7)=0.97. 

While the ambiguity search according to the second 
preferred embodiment of the present invention has been 
described for the calculation of a similarity factor for fixed 
length chains, it will be easily understood by those skilled in 

60 the art that it can be applied to a character string containing 
variable length chains. 

As described above, according to the present invention, 
document search can be performed in response to a vague 
request of the user for document search. In addition, since 

65 the present invention provides a character string search 
method for extracting a unique character string without 
using vocabulary information or grammatical information, 
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there is no need for maintenance of the vocabulary infor- 
mation or grammatical information, so that a search system 
that can consistently handle new words or phrases can be 
provided. Furthermore, since the present invention provides 
a character string search technique for relatively and 
dynamically extracting a unique character string, there is 
provided the advantage of being able to extract character 
strings which are unique to an input document, rather than 
to extract similar unique character strings from documents 
of a special type. 
What is claimed is: 

1. A method for identifying at least one unique character 
string in an input document which is input into a computer 
system, said computer system being operable to search one 
or more documents which are searchably stored in a storage 
medium, and said unique character string is used as a search 
string, the method comprising: 

associating and managing position information for a posi- 
tion in said searchably stored documents where one or 
more partial comparison document character strings are 
extracted from said searchably stored documents; 

extracting a partial input character string from said input 
document, and determining whether said partial input 
character string is a candidate character string; 

identifying a partial comparison document character 
string which matches at least a part of said candidate 
character string with a predetermined similarity factor 
or higher; 

identifying position data associated with said partial com- 
parison document character string which matches with 
said predetermined similarity factor or higher; and 

recognizing said candidate character string as the unique 
character string by comparing appearance frequency 
information of at least a part of said candidate character 
string appearing in said input document with the posi- 
tion data and evaluating an amount of feature of said 
candidate character string. 

2. A method for identifying at least one unique character 
string as said search string according to claim 1, wherein in 
recognizing said candidate character string as the unique 
character string, said amount of feature is a point allocation 
for said candidate character string and points are allocated 
according to the equation: 

Nnmwi 

(Ncount i// Nsize ij)/Nnum ix 

count i x Max(Nsize ij) I Max(Ncount ij) 



where: 

count i=appearance frequency of an i-th candidate char- 
acter string in the input document; 

Ncount i j=appearance frequency of a j-th partial input 
character string of the i-th candidate character string in 
the input document; 

Nsize i j=position data corresponding to the j-th partial 
input character string of the i-th candidate character 
string; 

Nnum i-a number of partial input character strings con- 
tained in an i-th candidate character string; 

Max (Nsize i j)=maximum Nsize i j for all searchably 
stored documents; 

Max (Ncount i j)=maximuni Ncount i j for all searchably 
stored documents. 
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3. A method for searching for a document to be searched 
from one or more documents searchably stored in a 
computer, said document to be searched having a character 
string similar to a partial input character string existing in an 

5 input document input in the computer, the method compris- 
ing: 

extracting a partial character string from said input 
document, and determining whether said partial char- 
acter string is a candidate character string; 

10 evaluating an amount of feature of said candidate char- 
acter string through comparison between appearance 
frequency information of at least a part of said candi- 
date character string appearing in said input document 
and appearance frequency information of at least a part 

15 of said candidate character string appearing in said 
searchably stored documents to recognize said candi- 
date character string as a unique character string; and 
searching for said document to be searched from said 

2Q searchably stored documents, wherein said document 
to be searched has a character string similar to said 
unique character string. 

4. The method for searching for a document according to 
claim 3, wherein in evaluating the amount of feature of said 
candidate character string, said amount of feature is a point 
allocation for said candidate character string and points are 
allocated according to the equation: 

Nnami 

V (Ncount ij/ Nsize i/J/Nnum fx 
30 fcf 

count i x Max(Nsize ij) I Max(Ncount ij) 



35 where: 

count i^appearance frequency of an i-th candidate char- 
acter string in the input document; 

Ncount i j=appearance frequency of a j-th partial input 
character string of the i-th candidate character string in 
40 the input document; 

Nsize i j=position data corresponding to the j-th partial 
input character string of the i-th candidate character 
string; 

Nnum i=a number of partial input character strings con- 
45 tained in an i-th candidate character string; 

Max (Nsize i j)-maximum Nsize i j for all searchably 

stored documents; 
Max (Ncount i j)=maximum Ncount i j for all searchably 
50 stored documents. 

5. A method for identifying at least one unique character 
string in an input document which is input into a computer 
system, said computer system being operable to search one 
or more documents stored in a storage medium, and using 
55 said unique character is used as a search string, the method 
comprising: 

extracting a partial input character string from said input 
document, and determining whether said partial input 
character string is a candidate character string; and 

60 evaluating an amount of feature of said candidate char- 
acter string through comparison between appearance 
frequency information of at least a part of said candi- 
date character string appearing in said input document 
and appearance frequency information of at least a part 

65 of said candidate character string appearing in said 
searchably stored documents to recognize said candi- 
date character string as said unique character string. 
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6. A method for identifying at least one unique character 
string as the search string of claim 5, wherein in evaluating 
the amount of feature, said amount of feature is a point 
allocation for said candidate character string and points are 
allocated according to the equation: 

Nnwnl 

^ (Ncount i jf Nsize i j) / Nnum i x 
j=\ 

count ixMax(Nsize ij) /Max(Ncount ij) 

where: 

count i=appearance frequency of an i-th candidate char- 
acter string in the input document; 

Ncount i j-appearance frequency of a j-th partial input 
character string of the i-th candidate character string in 
the input document; 

Nsize i j=position data corresponding to the j-th partial 
input character string of the i-th candidate character 
string; 

Nnum i=a number of partial input character strings con- 
tained in an i-th candidate character string; 

Max (Nsize i j)=maximum Nsize i j for all searchably 
stored documents; 

Max (Ncount i j)«maximum Ncount i j for all searchably 
stored documents. 

7. A method for evaluating similarity between a compari- 
son document and an input document which contains a first 
unique character string and a second unique character string 
input in a computer, said computer system being operable to 
search for said comparison document stored in a storage 
medium, the method comprising: 

calculating a first weight value corresponding to said first 
unique character string from appearance frequency 
information of at least a part of said first unique 
character string of said input document; 

calculating a second weight value corresponding to said 
second unique character string from appearance fre- 
quency information of at least a part of said second 
unique character string of said input document; 

calculating a first appearance frequency value of at least 
a part of said first unique character string appearing in 
said comparison document; 

calculating a second appearance frequency value of at 
least a part of said second unique character string 
appearing in said comparison document; and 

calculating a similarity factor between said input docu- 
ment and said comparison document from the first 
appearance frequency value taking said first weight 
value into account and the second appearance fre- 
quency value taking said second weight value into 
account. 

8. A method for evaluating similarity between a compari- 
son document and a unique character string input in a 
computer system, said computer system being operable to 
search for said comparison document stored in a storage 
medium, the method comprising: 

calculating a weight value corresponding to said unique 
character string from appearance frequency informa- 
tion of at least a part of said unique character string 
appearing in an input document; and 

calculating a similarity factor between said unique char- 
acter string and said comparison document from the 
appearance frequency information of at least a part of 
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said unique character string appearing in said compari- 
son document and said weight value. 

9. An apparatus for identifying at least one unique char- 
acter string in an input document which is input into a 

5 computer system, said computer system containing one or 
more documents which are searchably stored by the 
computer, the apparatus comprising: 

a storage device for storing a position information file 
which associates and manages position information for 
10 a position in said searchably stored documents where 
one or more partial comparison document character 
strings are extracted from said searchably stored docu- 
ments; 

means for extracting a candidate character string from 

15 said input document; 

means for identifying a partial comparison document 
character string which matches part, of said candidate 
character string with a predetermined similarity factor 

2Q or higher; 

means for identifying position data which is associated to 
said partial comparison document character string hav- 
ing the predetermined similarity factor or higher in said 
position information file; and 

25 means for recognizing said candidate character string as 
the unique character string by comparing appearance 
frequency information of at least a part of said candi- 
date character string appearing in said input document 
with said position information, and evaluating an 

30 amount of feature of said candidate character string. 

10. The apparatus for identifying at least one unique 
character string in an input document according to claim 9, 
wherein in said means for recognizing the candidate string 
as the unique string, said amount of feature is a point 

3 5 allocation for said candidate character string and points are 
allocated according to the equation: 

Nnmmi 

y* t (Ncount /// Nsize //)/Nnum fx 
y=i 

40 

count i xMax(Nsize /;') / Max(Ncount ij) 



where: 

45 count i-appearance frequency of an i-th candidate char- 
acter string in the input document; 

Ncount i j=appearance frequency of a j-th partial input 
character string of the i-th candidate character string in 
the input document; 

Nsize i j=position data corresponding to the j-th partial 
input character string of the i-th candidate character 
string; 

Nnum i=a number of partial input character strings con- 
55 tained in an i-th candidate character string; 

Max (Nsize i j)=maximum Nsize i j for all searchably 

stored documents; 
Max (Ncount ij) -maximum Ncount i j for all searchably 
stored documents. 
60 11. An apparatus for searching for a document to be 
searched from one or more documents searchably stored in 
a computer, said document to be searched having a character 
string similar to a partial input character string which exists 
in an input document input in the computer, the apparatus 
65 comprising: 

an input device for identifying said input document and 
instructing execution of a search; 
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means for detecting from said input device that said input 
document is identified and that said instruction of a 
search is input; 

means for extracting a candidate character string from 
said input document in response to the detection that 
said input document is identified and that said instruc- 
tion of a search is input; 

means for calculating an amount of feature of said can- 
didate character string through comparison between 
appearance frequency information at least a part of said 
candidate character string appearing in said input docu- 
ment and appearance frequency information of at least 
a part of said candidate character string appearing in 
said searchably stored documents; 

means for determining whether said candidate character 
string is a unique character string by evaluating said 
amount of feature; ... 

means for searching for the document to be searched from 
said searchably stored documents, wherein said docu- 
ment to be searched has a character string similar to 
said unique character string; and 

a display device for displaying the document to be 
searched having a character string similar to said 
unique character string. 

12. An apparatus for searching for a document according 
to U, wherein in said means for determining the unique 
character string, said amount of feature is a point allocation 
for said candidate character string and points are allocated 
according to the equation: 

Nmmi 

^ (Ncoum iy/Nsize i/)/Nnum lx 
/-] 

count /xMax(Nsize //) /Max(Ncount if) 



where: 

count i=appearance frequency of an i-th candidate char- 
acter string in the input document; 

Ncount i j -appearance frequency of a j-th partial input 
character string of the i-th candidate character string in 
the input document; 

Nsize i j-position data corresponding to the j-th partial 
input character string of the i-th candidate character 
string; 

Nnum i-a number of partial input character strings con- 
tained in an i-th candidate character string; 
Max (Nsize i j)=maximum Nsize i j for all searchably 

stored documents; 
Max (Ncount i j)=maximum Ncount i j for all searchably 

stored documents. 
13. An apparatus for identifying at least one unique 
character string in an input document which is input into a 
computer system, said computer system containing one or 
more documents which are searchably stored by the 
computer, and said unique character string is used as a 
search string, the apparatus comprising: 

means for extracting a candidate character string from 

said input document; and 
means for determining whether said candidate character 
string is a unique character string by evaluating an 
amount of feature of said candidate character string 
through comparison between appearance frequency 
information of at least a part of said candidate character 
string appearing in said input document and appearance 
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frequency information of at least a part of said candi- 
date character string appearing in said searchably 
stored documents. 
14. An apparatus for evaluating similarity between a 
5 comparison document and an input document containing a 
unique character string input into a computer system, said 
computer system containing a comparison document search- 
ably stored by the computer, the apparatus comprising: 
means for calculating a weight value corresponding to 
iq said unique character string from appearance frequency 
information of at least a part of said unique character 
string appearing in said input document; and 
means for calculating a similarity factor between said 
input document and said comparison document from 
is the appearance frequency information of at least a part 
of said unique character string appearing in said com- 
parison document and said weight value. 
• 15. A storage medium readable by a computer for storing 
a program operable to identify a document input into a 
20 computer system based on an input document, said com- 
puter system containing one ore more documents which are 
searchably stored by the computer, the program comprising: 
program code means for directing said computer to 
extract a partial character string from said input docu- 
25 ment and determining whether the partial character 
string is a candidate character string; and 
program code means for directing said computer to deter- 
mine whether the candidate character string is a unique 
character string by evaluating an amount of feature of 
30 said candidate character string through comparison 
between appearance frequency information of at least a 
part of said candidate character string appearing in said 
input document and appearance frequency information 
of at least a part of said candidate character string 
35 appearing in said searchably stored documents. 

16. The storage medium readable by a computer for 
storing a program operable to identify a document according 
to claim 15, wherein in said program code means for 
determining whether the candidate character string is a 
40 unique character string, said amount of feature is a point 
allocation for said candidate character string and points are 
allocated according to the equation: 

45 ^ (Ncount ijj Nsize //)/Nnum ix 

count f x Max(Nsizc ij) j Max(Ncount if) 



where: 

count i=appearance frequency of an i-th candidate char- 
acter string in the input document; 

Ncount i j=appearance frequency of a j-th partial input 
55 character string of the i-th candidate character string in 
the input document; 

Nsize i j=position data corresponding to the j-th partial 
input character string of the i-th candidate character 
string; 

60 Nnum i=a number of partial input character strings con- 
tained in an i-th candidate character string; 
Max (Nsize i j)=*maximum Nsize i j for all searchably 

stored documents; 
Max (Ncount i j)=maximum Ncount i j for all searchably 
65 stored documents. 

17. The storage medium readable by a computer accord- 
ing to claim 15 further comprising: 
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program code means for searching for a document to be 
searched from said searchably stored documents, 
wherein said document to be searched has a character 
string similar to said unique character string. 

18. A storage medium readable by a computer for storing 
a program which is operable to evaluate similarity between 
a comparison document and an input document containing a 
unique character string input into a computer system, said 
comparison document being searchably stored by the 
computer, the program comprising: 

program code means for directing said computer to cal- 
culate a weight value corresponding to said unique 
character string from appearance frequency informa- 
tion of at least a part of said unique character string 
appearing in said input document; and 

program code means for directing said computer to cal- 

culate a similarity factor between said input document 

said comparison document from the appearance fre- 
quency information of at least a part of said unique 
character string appearing in said comparison docu- 
ment and the weight value. 

19. A medium readable by a computer for storing a 
program operable to identify at least one unique character 
string in an input document which is input into a computer 
system, said computer system being operable to search one 
or more documents which are searchably stored in a storage 
medium, and said unique character string is used as a search 
string, the program comprising: 



10 



15 



20 



program code means for associating and managing posi- 
tion information for a position in said searchably stored 
documents where one or more partial comparison docu- 
ment character strings are extracted from said search- 
ably stored documents; 

program code means for extracting a partial input char- 
acter string from said input document, and determining 
whether said partial input character string is a candidate 
character string; 

program code means for identifying a partial comparison 
document character string which matches at least a part 
of said candidate character string with a predetermined 
similarity factor or higher; 

program code means for identifying position data asso- 
ciated with said partial comparison document character 
string which matches with said predetermined similar- 
ity factor or higher; and 

program code means for recognizing said candidate char- 
acter string as the unique character string by comparing 
appearance frequency information of at least a part of 
said candidate character string appearing in said input 
document with the position data and evaluating an 
amount of feature of said candidate character string. 
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